Docs: Contributing

Join us! #

There is an incredible amount of work ahead, in all areas of engineering!

The best way to get involved is to join the Fedora Matrix channel for our Data Working Group: #data:fedoraproject.org. If you’ve never used Matrix before, this link will walk you through creating a Fedora Accounts login and joining us there.

You’d also be welcome to join us on Fedora Discussions (Discourse). Just be sure to tag your post with #commops, since the “Community Operations” group is the primary driver of community health analytics.

Note: Hatlas is not an official Fedora project (yet!), but many interested parties are coordinating efforts in Matrix.

5-second overview #

Last updated: 2025-11-09

Target Architecture: #

Current Architecture: #

Contribution areas #

Infrastructure #

It all starts with Infrastructure! Here is a set of high-level TODOs in approximate delivery order.

TODO: Observability #

Our current deployment has very little observability. This is my top priority, but my first pass will likely be minimalistic and focused on just keeping the thing alive, which will leave plenty of opportunity for improvement.

Like all other things Hatlas, it would be neat to make these public too.

TODO: Deploy Polaris to Fedora Infra #

This is a top priority, but it’s also a huge amount of work! I plan to start this ASAP but it will take some time before it’s ready. We may also need to defer this until after Hatlas has helped us solidify our data formats / understand our infra requirements (e.g. expected load).

Polaris is currently running in a container on my personal VPS and backed by my Cloudflare R2 storage account.

See also: Libera, Stripe, Patreon – I’m unemployed at the moment and will shamelessly take all the help I can get. My wife thanks you for supporting open data!

Challenges #

Moving the container should be fairly easy, but:

We will need to request and allocate significant resources within Fedora Infra including disk space, ram, and bandwidth. Current values are unknown, so we can’t even make the request yet.
- A part of the reason Hatlas exists is to help us gather this info.
- These resources are not yet approved and therefore there is no guarantee of availability or timing.
We need to deploy to OpenShift, which is a flavor of Kubernetes that is generally locked behind RedHat gates.
- Sorry RedHat, you need to do better here. I can’t deploy what I’m not allowed to touch or even read the docs on.
Within Fedora Infra we will likely need some S3-compatible storage.
- OpenShift does apparently provide this as a feature, but I have no experience with it. See above.
We will almost certainly need to complete the OAuth POC before we can call this done.

TODO: More QuickStarts #

I’d like to have QuickStarts for at least:

Spark 4
Trino
Spark 3.5

TODO: OAuth #

Fedora Accounts Service (“FAS”) is Fedora’s OAuth / OIDC provider.

Can we integrate with this directly as a POC running outside of Fedora Infra? Presumably, no. At least, not in FAS prod.
If not, let’s stand up our own identity provider and integrate Polaris access with it.
- This would ease my mind in several areas, including identifying bad actors and requiring agreement to the data usage guidelines before granting access.

TODO: Data Orchestration Engine #

All data engineering is currently being triggered in an ad-hoc basis, and using scripts which are still quite minimalistic. Ideally, we would deploy something like Apache Airflow or Dagster to get these things automated.

TODO: Public Trino #

I’m uncertain as to whether this is feasible, but it would be nice to POC a fully-hosted public query interface to further lower the barriers to entry. This would presumably require OAuth even for the POC launch to avoid abuse.

Programming #

TODO: Integrate with Fedora’s “Personal Data Removal” process #

Fedora has a “personal data removal” process (“PDR”) for compliance with GDPR. However:

I have not yet had time to look at the internals of how this works.
We need to wire the lakehouse up to the PDR process so that deletes automatically propagate here.

Data Engineering & Architecture #

TODO: Datanommer Quality: Silver #

Datanommer is our most important dataset. The dataset page outlines the general status and steps ahead. There may be some overlaps with infra, since it’s uncertain whether our current software stack sufficiently supports the features we need.

TODO: Countme: Bronze #

Countme is “low-hanging fruit” for a bronze conversion. (Read: it should be pretty easy! And I’m saving it for you!)

This dataset gives us statistics as to how prevalent each Fedora and CentOS release is in the wild.

We already have this data available in other formats, but bringing it into Hatlas would allow us to unify where we are performing our analytics.

Data Analysis #

Yes!