HATLAS - a fedora data project

Docs: Datasets: Datanommer

About #

Datanommer is historical data from “Fedora Messaging”, which is the message bus (based on RabbitMQ) that powers our event-driven apps. It contains JSON messages sent from many internal and external systems, with the intended recipients being other apps in our ecosystem. These messages represent both user activity and automated system activity.

For more info:

Status #

Last updated: 2025-10-08

For the Hatlas launch I only want to have one month of data, which is roughly 1.5Gb. I plan to backfill this in the near future after validating infra costs and community approval. Watch this page or the news.

Schema #

Bronze #

messages hasMany users
messages hasMany packages

fedora.datanommer_messages_bronze #

Name Type Required Notes
id long True
msg_id string True Supposed to be a UUID but messages in early Datanommer history are not conformant.
topic string True
timestamp timestamptz True
category string True These are used to facilitate Datagrepper queries.
agent_name string False The name of the user, though sometimes this is empty and exists in the msg.
source_name string True System metadata. IMO this column should be dropped.
source_version string True System metadata. IMO this column should be dropped.
msg string True Topic-specific JSON.
headers string False Topic-specific JSON.

fedora.datanommer_users_bronze #

Name Type Required Notes
id integer True
name string True Username

fedora.datanommer_packages_bronze #

Name Type Required Notes
id integer True
name string True

Join Tables #

fedora.datanommer_users_messages_bronze #

Name Type Required Notes
user_id integer True
msg_id string True
msg_timestamp timestamptz True

Note: Datagrepper performs this join on both msg_id and msg_timestamp.

fedora.datanommer_packages_messages_bronze #

Name Type Required Notes
package_id integer True
msg_id string True
msg_timestamp timestamptz True

Note: Datagrepper performs this join on both msg_id and msg_timestamp.

Silver (planned) #

See the explanation in the silver roadmap. Colored elements are changes from bronze.

fedora.datanommer_messages_silver #

Name Type Required Notes
id long True
msg_id uuid True Rows with prefixed UUIDs will be stripped of their prefix.
topic string True
timestamp timestamptz True
category string True
agent_name string False
source_name string True
source_version string True
msg variant True
headers variant False

fedora.datanommer_users_messages_silver #

Name Type Required Notes
user_id integer True
msg_id uuid True Rows with prefixed UUIDs will be stripped of their prefix.
msg_timestamp timestamptz True

fedora.datanommer_packages_messages_silver #

Name Type Required Notes
package_id integer True
msg_id uuid True Rows with prefixed UUIDs will be stripped of their prefix.
msg_timestamp timestamptz True

Pulling from upstream #

Fedora Infrastructure makes their database backups available at https://infrastructure.fedoraproject.org/infra/db-dumps/ . Instructions for downloading a Datanommer snapshot and running it locally are available at https://codeberg.org/fedora-mwinters/datanommer-restore

PII #

There has not been a comprehensive review of the data but myself and others have spent some time spot-checking topic data and have not found any beyond usernames, which we don’t consider to be PII. (There is no need to use your real name as a username.) That said, it would be good for someone to verify the entire dataset. There are currently >28,000 topics in the database.

Of note, this data has been made publicly for many years. I believe any glaring issues would have been discovered by now, though there is always the possibility that we may discover more issues once the data is made more accessible. Should we find any, we can simply delete any messages from a given topic from the data lake. However, the upstream question of Datagrepper / Datanommer changes is less clear.

Quality Roadmap #

Refinement challenges #

Even though each topic is theoretically supposed to have a consistent JSON schema, the reality is that sometimes the schemas do change, including for reasons outside of our control. Discovering the full schema for a given topic requires one to read every message ever written to that topic, and even then you have no guarantee of future stability.

My original “silver” data plan was to perform this sort of schema inference for core topics and extract these messages into a searchable schema, and provide a mechanism to detect and reflect future schema changes. However, a better option has emerged.

Silver #

Proper UUID #

msg_id is effectively a UUID except in early messages in Datanommer history. These messages have the year prepended to the UUID for reasons lost to time. We will convert this to a proper UUID for silver by stripping out the prefix, including in the join tables.

VARIANT for JSON data #

Iceberg v3 has support for Spark’s VARIANT type. See also: Iceberg / Parquet docs.

Quoting from the Parquet docs:

The Variant Binary Encoding allows representation of semi-structured data (e.g. JSON) in a form that can be efficiently queried by path. The design is intended to allow efficient access to nested data even in the presence of very wide or deep structures.

This should provide a reasonable level of performance for querying JSON data without requiring us to perform complex schema management. It is a new type though, and support across query engines is still emerging, so we may need to wait a bit.

Coverage Roadmap #

Backfill #

After we have validated the fundamental assumptions of Hatlas and our ability to continue hosting it (hopefully within a few days), it will be fairly trivial to manually backfill this dataset.

It remains TBD how much of that data can live in this POC, frankly depending on hosting expenses and my ability to personally fund them.

Automation #

Automation tooling for future coverage is still TODO. The plan for Datanommer is:

  1. Modify the upstream processes to include nightly pg_dumps of the previous 24h, with retention of 14 days.
    • This allows those who are running a Datanommer copy on non-Fedora infra (such as myself) to stay current without requiring a full snapshot reload.
  2. Get automation tooling into place such as Airflow or Dagster.
  3. Automate a nightly batch ingest of the previous 24h.
  4. Re-assess.
    • We would prefer to perform streaming ingest, but we need to avoid increasing the load on Datanommer’s DB. There are a number of possible avenues to pursue, one of which may be to reduce the current Datanommer load by truncating its history, after we’ve determined that it’s sufficiently represented here.