Overdue updates
I’ve not updated this space in quite some time as I am still discovering new opportunities and obstacles within the problem space. Plus, my priorities tend to shift based on whomever is around and offering to help, which is noise I try to avoid propagating.
However, I do generally try to produce weekly(-ish) status updates in the Matrix channel, and I’ve decided to start providing something similar here since I do see some hits against the RSS feed.
Personal note: I’m still unemployed #
Sorry for the interruption, but if you’re aware of any Sr. / Staff SRE or SWE roles, especially those with a focus on open data and/or open source, please reach out.
Alternatively, if you might be able to contribute to any of the hosting costs for Hatlas (roughly $100/mo), I have several options available:
Major changes since November #
Pause on the cutting edge of Iceberg #
Lakehouse architecture and specifically Iceberg is what drew me to launch this project in the first place. This data model is simultaneously far more flexible than the alternatives while also being at least one order of magnitude cheaper – likely several. The cost savings aspect is especially what drew me in this direction, because this puts medium- to large-scale analytics within the budget of a hobbyist or cost-conscious non-profit.
I still believe in this direction, but I’ve had to pause these efforts temporarily because Apache Polaris (which is not the only Iceberg REST solution, but certainly the most prominent and the most promising) only fully supported AWS S3, not any other “S3 compatible” storage. AWS S3 is terribly expensive compared to basically all other options, so non-AWS support was and continues to be critical to our story.
I launched the Hatlas lakehouse shortly after support for non-AWS S3 buckets landed on the Polaris main branch. This actually worked just fine with pyiceberg, which is what our POC Jupyter notebook used. However, I had perhaps gotten ahead of myself because I quickly realized that no other client / query engine knew how to handle this configuration. This meant that I had exactly one tool with which to manage the Lakehouse, including tasks such as data maintenance, which pyiceberg wasn’t really built for.
We now had the inverse of all of our design goals: we had a single working client, and our costs would quickly rise without a process to clean out stale data. Meanwhile, we had experienced analysts who were waiting in the wings and eager to get moving, and they couldn’t connect using their tools of choice.
I made the difficult (for me) decision to temporarily pull the plug on the Iceberg approach and adopt an interim solution which is far more traditional and will deliver more immediate value.
Postgres is dead, long live Postgres #
We had two experienced analysts who were basically sitting idle, both of whom knew Postgres well, and our source data is in Postgres anyway, albeit very poorly-optimized. I decided to accept the idea of doing traditional “everything in Postgres” analytics for now.
I granted trusted members of FDWG direct access to the Hatlas Datanommer replica, but first we needed to perform some optimizations in order to make this work feasible. Upstream has json blobs stored as text, which meant that every query had to json-parse every single row, which took basically forever. Additionally we have a “topic” column which fits very naturally with Postgres' ltree type, and almost every query includes a topic WHERE clause.
It took me a few weeks to figure out the best approach for this because A) I’m not a Postgres expert, and B) nothing with this dataset is easy. For example, some of our JSON blobs are larger than the hard-coded maximum size that Postgres can parse before it gives up. We also have blobs that contain JSON nulls, which Postgres blows up on. You have to handle these errors and others in several different ways in order to have a process that can work through the whole dataset, plus I wanted to ensure that our approach would continue to work as we ingest new data.
Once I had a process, I had to run it across the entire dataset. And this is where you become a Postgres expert, because getting this to run well required knowledge of most of the Postgres tunables plus btrfs optimizations. I was figuring this out as I went, so overall the full conversion for both data types took an entire calendar month! This is because the database is large enough that it doesn’t fit on any nvme or ssd storage available to me, and you need to really understand Postgres internals in order to optimize for HDD head seek times. I think that if I were to run this again it would take significantly less time with the tuning applied, and it would probably only take hours if I had a large nvme volume. I guess I can say I’m a DBA now though!
The optimizations turned out to be worth it: queries against json fields are now 36x faster, and queries against topic are now 2x faster! We finally have a database that is feasible for analytics! (Even though it’s far from optimal.)
This puts us into position to start considering how we want to transform this data, and we can start building these transformation pipelines with tools that will translate directly to Iceberg later e.g. SQLMesh.
Unfortunately, this work was completed just in time for both of our “rockstar” analysts to pull back from their Fedora contributions :(. One had a career change and the other had a major life change, all of which is totally understandable. Hopefully we’ll see their return as they settle into their new normal, and hopefully we’ll continue to see new curious onlookers joining our Matrix channel.
WIP: Shared foundations #
I continue to focus on building out the foundations that we need for a multi-tenant and open-infra analytics environment:
On the data side #
-
I’m working upstream with Fedora Infra on several critical aspects of this work. For example:
- Upstream Datanommer (our main dataset) is now producing nightly exports so that our replica can stay in sync.
- I’m also working to get ssh git access enabled on Fedora Forge, which is Fedora’s new ForgeJo-based home, as that is critical to my workflow and it is the only thing preventing me from moving the Hatlas code repos out of my personal Codeberg.
-
We finally have a simple and reliable path to produce a canonical dataset in Parquet (via
pg_parquet). This is available for FDWG members.- Some test queries against a local dataset with DuckDB are astoundingly fast – sometimes 1000x faster than Postgres. To me, this further validates the Lakehouse path.
-
I’ve designed and implemented a PII-protection scheme (currently only implemented in Postgres but applicable elsewhere), whereby our transformations from “barely-queryable data” into “ideal analytical data” also separate out identifiers (PII) into privileged tables.
- This means that as our transformation pipelines improve, we will be able to implement a very low bar for access to analytics work while retaining a high minimum level of trust for access to PII.
- More details in a future update.
-
I’ve started creating a data dictionary to capture all of my / our accumulated knowledge about Datanommer’s data.
- These facts exists elsewhere but are spread across many different repos and technologies, and even if we could somehow ingest them alll we would still have an incomplete understanding of our data.
- Building this dictionary is laborious work, and slightly fragile since updates elsewhere will need to be manually propagated to the dictionary. The upshot is that we can build tools that fully understand our dataset, which is needed for:
-
I’ve started building some prototype data transformation pipelines in SQLMesh.
- These accomplish the transformations I think we need, although the current pipelines were built by hand as a POC.
- I’m WIP with building a data-driven SQLMesh pipeline generator, which will (re)build all pipelines from the data dictionary. It should then be trivial to orchestrate these pipelines based on the same dictionary.
-
The data dictionary includes a rudimentary taxonomy of user actions.
- This is nothing short of an attempt to classify all human activity on Fedora, in part to ensure that all such activities are actually being captured.
- I’m hoping to eventually run my version past the academic-types in this space such as those within CHAOSS, as any taxonomy will have a tendency towards being highly-subjective and localized to one’s personal view on the project.
- I’m foolishly hopeful that that the collaboration necessary to get this formalized will actually act as a form of knowledge capture to help people see the Fedora project more holistically. (Including myself!)
-
We have many different use cases for analyzing http traffic: the docs team wants to know which docs are most- / least-popular, infra wants to know which ASNs are driving our DDOS traffic, etc.
- I made a POC of ingesting apache logs into Clickhouse and enriching them with geospatial data while also pseudonymizing them, and then analyzing with Grafana.
- The results were promising, but the data sets here are large enough that hosting hot data on Hatlas would potentially conflict with our other pursuits. I’ve dropped this for now, but I plan to return if/when we get our Lakehouse back on track.
On the infra side #
-
I’ve continued to flesh out our bare-metal Kubernetes deployment with an eye towards multi-tenant access.
- Much of the tech debt from “this is my throwaway dev environment” has been cleaned up, and we now have basic security measures in place such as NetworkPolicies and SSO.
-
The cleaned-up infra code helped me migrate from OVHCloud to Hetzner to reduce costs by about 40%.
-
The Docker work to upgrade and run our Datanommer replica is now shared and documented.
-
I’m working on refactoring my infra-as-code (e.g. ArgoCD repos) further so that anyone in FDWG will be able to make a PR against our analytics tooling.
- The goal is to fully open up all Hatlas infra code and throw open the gates to FDWG as a project bigger than myself.
-
Polaris is due for a revisit.
- When I paused the Polaris POC, v1.3 was still unreleased. That version finally came out while I was still in the depths of the Postgres optimizations, and v1.4 just came out a few days ago. Time to try again!
On the policy side #
-
Since I’ve not received any traction from RedHat Legal, I’ve asked my favorite non-lawyer Claude Opus to review our GDPR stance.
- In short, the broad strokes of what we’re doing are well within GDPR but we have some policy updates and technical controls which need to be freshened up since Fedora last made its big GDPR push in 2018.
- These are WIP both upstream in coordination with Fedora leadership and within Hatlas.
-
The biggest user-visible portion of this is that we need to enact an access gate for FDWG analysts.
- This is what I’m currently focused on, both at the policy and infra level, and I have not updated any of the data on Hatlas since receiving this advice.
This was … long. Sorry about that. I’m impressed that you actually read this! I will try to keep future updates much shorter. This means that they will be less concrete / more tentative in tone. I need to become ok with that :)