2022#23 Readings 🇺🇦🌻

October 4, 2022 8 minutes read | 1697 words by Ruben Berenguel

Some links are affiliate links

If you are reading this as a subscriber, you’ll see some difference.

Last week I was on holidays, which means I have read a lot of my reading backlog. I have also decided to change the format of these posts and the platform for the newsletter.

Stable diffusion, prompt data data everywhere, dramatic lighting, elegant, intricate, artstation, 8k, ornate, artgerm, vibrant colors, trending on artstation, greg rutkowski -s150 -W512 -H512 -C7.5 -Ak_lms -U 4.0 -F -S3821664505

Code

If you are into optimising your Python code, one very rarely used (for good reasons!) is code generation. In Advanced Python: Achieving High Performance With Code Generation. The application is very narrow but powerful: directly generating bytecode for tight inner loops. The post is detailed, and can be useful if you don’t want to go to native code extensions of Cython.

If you prefer Go to Python, Simon Eskildsen has written a very detailed post on optimising Go, Scaling Causal’s Spreadsheet Engine From Thousands to Billions of Cells: From Maps to Arrays. The TL;DR is moving from maps to arrays, and the target is the computing engine behind a multi-dimensional spreadsheet SAAS, Causal. A takeaway from this post that applies to any language:

When analyzing a system top-down with a profiler, it’s easy to miss the forest for the trees. It helps to take a step back, and analyze the problem from first principles.

Unison is a functional language created by Rúnar Bjarnason and Paul Chiusano (the authors of Functional Programming in Scala, this link is for the new version for Scala 3). It is designed for distributed computing, and the surprising concept is that functions are just labels over hashes applied to ASTs. You can find out more in Trying out unison Part 1: code as hashes.

A quick one, Hynek Schlawack details how he is using PyOxidizer to build a portable Python executable for his project Doc2Dash, You Can Build Portable Binaries of Python Applications.

Data Engineering

DuckDB keeps adding functionality at an impressive pace. In case you don’t know, DuckDB is an in-process OLAP database, i.e. what SQLite is in OLTP. The most recent addition is being able to query directly from Postgres. The DuckDB team has written all the details in DuckDB - Querying Postgres Tables Directly From DuckDB. It is a DuckDB plug-in offering a table scan for Postgres tables. It has predicate pushdown and is parallelized. Impressive overall.

If you have been using Kafka for long you will have seen the removal of ZooKeeper coming. ZooKeeper is (or was) used initially by Kafka to keep track of partitions, replica status and electing a cluster leader. In old versions (I think this changed in 1.0, maybe 2.0) clients would connect to ZooKeeper to get metadata details. This was removed, and changed to be done directly to a Kafka broker to reduce requests to ZooKeeper: you can keep the instances of Kafka large and those of ZooKeeper as small as possible. The last move has been moving all this metadata to Kafka itself and implementing a Raft-like consensus algorithm using Kafka primitives. Confluent has written a detailed post about this: Why ZooKeeper Was Replaced With KRaft - The Log of All Logs

Jamie Brandon (the creator of the HYTRADBOI database conference earlier this year and database jam this past week) wrote an extensive post about query planning for streaming systems last year. It covers real examples (materialize, ksqldb, Flink) and the complications appearing in real time. Surprisingly, using humans' judgement (hinting or rewriting the query) can be good enough compared with complex cost-based query planning.

You may have heard about data mesh. What is it? Is it something you should worry about? The TL;DR I can offer you is that unless you are in a very large company you shouldn’t worry or try to apply it. Otherwise, you should read LakeFS’s post Data Mesh: What Is It and What Does It Mean for Data Engineers?

Data Products

The idea of data products has been brewing for a long while. I suspect it was originally appearing in the Data Mesh “manifesto” by Zhamak Dehghani (now available as a full book, see also the last post of this section). Mapping the end-to-end reality of how complex data stuff can be daunting: the user wants a dashboard. The reality involves data engineering work, data modelling work and deep analysis work. You can hear/read more about it in this talk by Stephen Bailey from last year’s Coalesce conference, Smaller Black Boxes: Towards Modular Data Products. A similar problem is covered by Eric Weber in Making Data Actionable: The Immense Challenge of Good Data Products. And in case you want to read even more, Run Your Data Team Like a Product Team in the Locally Optimistic blog has some ideas about customers and data products.

AirByte has written a relatively interesting piece about the relationship between orchestration (i.e. Airflow) and Data Products, Data Orchestration Trends: The Shift From Data Pipelines to Data Products. If you ignore the “data product” section, it is a very good post comparing several orchestrators (omits Flyte, though).

LinkedIn offers their take on data modelling and data products, and calls it SuperTables. The interesting points to cover are how they make it a product by establishing SLAs, change notices and stability. Have a read, it’s interesting: Super Tables: The Road to Building Reliable and Discoverable Data Products.

Connected with Data Products we have data contracts. There is still no pure formulation of them, but the idea is establishing communication between service/product engineers and the data consumers using data created by them. Chad Sanderson has been talking about them extensively, and seems to be creating a product around them. A good one is this: The Rise of Data Contracts. If you are a data engineer, this quote from the article will hit home:

If you talk to almost any SWE that is producing operational data which is also being used for business-critical analytics they will probably have no idea who the customers are for that data, how it’s being used, and why it’s important.

Miscellaneous Data

Memory traditions (I would recommend you Memory Craft by Lynne Kelly, 6 stars out of 5 for me) have used representations that are in a way data visualizations. For example the Yakama Time Ball represents life events with knots and beads in a string.

I’m sure there is some data hero in your company. Someone who is always the first to solve a failure, who knows what lies deep in all those SQL files and can explain why it was done that weird way. Mikkel Dengsoe explains why you should treat them well in The Unsung Data Heroes.

Anna Filippova from the dbt labs newsletter offers a thought we all should consider:

Being data driven is the responsibility of your executive team Full stop. Getting an organization to be data driven without executive buy-in is a little like campaigning for better election oversight in a totalitarian state — kind of futile.

What is an insight? When your stakeholders request more insights from your data team… what are they expecting? Have a read to this Forbes piece, Insight Literacy: Why We Need to Clarify What Insights Really Are. You can find a great quote by Gary Klein (expertise researcher, author of several books I have recommended here) there, defining an insight as “an unexpected shift in the way we understand things.”

Documentation is the bane of most data orgs I know. Almost all data engineers, scientists and architects I know love documenting what they do, but are never offered the time or the incentives. You will find some ideas on how to address it (and why) in this post: Gamification of Data Knowledge.

Miscellaneous Software engineering stuff

Out of a tweet from Gergely Orosz, Simon Willison has written a whole post about good software engineering practices. There’s nothing to disagree with them, and as action item you should think in which your work is weak and work to improve them.

Being more visible (writing, speaking at conferences, talking with people in other departments…) has many advantages, in particular that it increases your luck. Why? Because it increases the number of opportunities for something positive to happen. This is covered in this article from GitHub’s ReadME, Publishing Your Work Increases Your Luck and also in How to Increase Your Luck Surface Area.

Slightly related to the subject, you can read (paywalled! but worth it) Gergely Orosz’s Internal Politics for Software Engineers and Managers: Part 2. This is the kind of “stuff at work” that most engineers are bad at, and we all should take advantage of them. And related to this last one, Will Larson wrote a post earlier this year about decisions on when to leave your work, what to maximise for (learning, money, time) particularly in these turbulent times.

Hillel Wayne has a post in GitHub’s ReadME, with the odd title of The Five-Minute Feedback Fix. From the title you may think it is about management, but it is actually about lightweight formal methods. In particular, it introduces decision tables and recommends Alloy. I have written a couple posts about Alloy: Modelling data pipelines with Alloy and Data pipelines with Alloy, Take 2.

Finally, a non-so-much-engineering post by Steve Blank on how to understand any industry (or actually, concept): Mapping the Unknown – The Ten Steps to Map Any Industry. It reminds me of the steps needed for a Wardley map or even a concept map.

Buy me a coffee