Docs: Datasets: Datanommer

About #

Datanommer is historical data from “Fedora Messaging”, which is the message bus (based on RabbitMQ) that powers our event-driven apps. It contains JSON messages sent from many internal and external systems, with the intended recipients being other apps in our ecosystem. These messages represent both user activity and automated system activity.

For more info:

Rudimentary architecture diagram
Fedora Messaging docs (now somewhat stale)
Fedora Infra message bus docs

Status #

Last updated: 2025-10-08

Quality: Bronze
Coverage: Currently only 2025-08
Automation: None
Load: Batch

For the Hatlas launch I only want to have one month of data, which is roughly 1.5Gb. I plan to backfill this in the near future after validating infra costs and community approval. Watch this page or the news.

Schema #

Bronze #

messages hasMany users
messages hasMany packages

`fedora.datanommer_messages_bronze` #

Name	Type	Required	Notes
id	long	True
msg_id	string	True	Supposed to be a UUID but messages in early Datanommer history are not conformant.
topic	string	True
timestamp	timestamptz	True
category	string	True	These are used to facilitate Datagrepper queries.
agent_name	string	False	The name of the user, though sometimes this is empty and exists in the msg.
source_name	string	True	System metadata. IMO this column should be dropped.
source_version	string	True	System metadata. IMO this column should be dropped.
msg	string	True	Topic-specific JSON.
headers	string	False	Topic-specific JSON.

`fedora.datanommer_users_bronze` #

Name	Type	Required	Notes
id	integer	True
name	string	True	Username

`fedora.datanommer_packages_bronze` #

Name	Type	Required	Notes
id	integer	True
name	string	True

Join Tables #

`fedora.datanommer_users_messages_bronze` #

Name	Type	Required
user_id	integer	True
msg_id	string	True
msg_timestamp	timestamptz	True

Note: Datagrepper performs this join on both msg_id and msg_timestamp.

`fedora.datanommer_packages_messages_bronze` #

Name	Type	Required
package_id	integer	True
msg_id	string	True
msg_timestamp	timestamptz	True

Note: Datagrepper performs this join on both msg_id and msg_timestamp.

Silver (planned) #

See the explanation in the silver roadmap. Colored elements are changes from bronze.

`fedora.datanommer_messages_silver` #

Name	Type	Required	Notes
id	long	True
msg_id	uuid	True	Rows with prefixed UUIDs will be stripped of their prefix.
topic	string	True
timestamp	timestamptz	True
category	string	True
agent_name	string	False
source_name	string	True
source_version	string	True
msg	variant	True
headers	variant	False

`fedora.datanommer_users_messages_silver` #

Name	Type	Required	Notes
user_id	integer	True
msg_id	uuid	True	Rows with prefixed UUIDs will be stripped of their prefix.
msg_timestamp	timestamptz	True

`fedora.datanommer_packages_messages_silver` #

Name	Type	Required	Notes
package_id	integer	True
msg_id	uuid	True	Rows with prefixed UUIDs will be stripped of their prefix.
msg_timestamp	timestamptz	True

Pulling from upstream #

Fedora Infrastructure makes their database backups available at https://infrastructure.fedoraproject.org/infra/db-dumps/ . Instructions for downloading a Datanommer snapshot and running it locally are available at https://codeberg.org/fedora-mwinters/datanommer-restore

PII #

There has not been a comprehensive review of the data but myself and others have spent some time spot-checking topic data and have not found any beyond usernames, which we don’t consider to be PII. (There is no need to use your real name as a username.) That said, it would be good for someone to verify the entire dataset. There are currently >28,000 topics in the database.

Of note, this data has been made publicly for many years. I believe any glaring issues would have been discovered by now, though there is always the possibility that we may discover more issues once the data is made more accessible. Should we find any, we can simply delete any messages from a given topic from the data lake. However, the upstream question of Datagrepper / Datanommer changes is less clear.

Quality Roadmap #

Even though each topic is theoretically supposed to have a consistent JSON schema, the reality is that sometimes the schemas do change, including for reasons outside of our control. Discovering the full schema for a given topic requires one to read every message ever written to that topic, and even then you have no guarantee of future stability.

My original “silver” data plan was to perform this sort of schema inference for core topics and extract these messages into a searchable schema, and provide a mechanism to detect and reflect future schema changes. However, a better option has emerged.

Silver #

Proper UUID #

msg_id is effectively a UUID except in early messages in Datanommer history. These messages have the year prepended to the UUID for reasons lost to time. We will convert this to a proper UUID for silver by stripping out the prefix, including in the join tables.

VARIANT for JSON data #

Iceberg v3 has support for Spark’s VARIANT type. See also: Iceberg / Parquet docs.

Quoting from the Parquet docs:

The Variant Binary Encoding allows representation of semi-structured data (e.g. JSON) in a form that can be efficiently queried by path. The design is intended to allow efficient access to nested data even in the presence of very wide or deep structures.

This should provide a reasonable level of performance for querying JSON data without requiring us to perform complex schema management. It is a new type though, and support across query engines is still emerging, so we may need to wait a bit.

Coverage Roadmap #

Backfill #

After we have validated the fundamental assumptions of Hatlas and our ability to continue hosting it (hopefully within a few days), it will be fairly trivial to manually backfill this dataset.

It remains TBD how much of that data can live in this POC, frankly depending on hosting expenses and my ability to personally fund them.

Automation #

Automation tooling for future coverage is still TODO. The plan for Datanommer is:

Modify the upstream processes to include nightly pg_dumps of the previous 24h, with retention of 14 days.
- This allows those who are running a Datanommer copy on non-Fedora infra (such as myself) to stay current without requiring a full snapshot reload.
Get automation tooling into place such as Airflow or Dagster.
Automate a nightly batch ingest of the previous 24h.
Re-assess.
- We would prefer to perform streaming ingest, but we need to avoid increasing the load on Datanommer’s DB. There are a number of possible avenues to pursue, one of which may be to reduce the current Datanommer load by truncating its history, after we’ve determined that it’s sufficiently represented here.