8 minutes read | 1697 words by Ruben BerenguelSome links are affiliate links
If you are reading this as a subscriber, you’ll see some difference.
Last week I was on holidays, which means I have read a lot of my reading backlog. I have also decided to change the format of these posts and the platform for the newsletter.
If you are into optimising your Python code, one very rarely used (for good reasons!) is code generation. In Advanced Python: Achieving High Performance With Code Generation. The application is very narrow but powerful: directly generating bytecode for tight inner loops. The post is detailed, and can be useful if you don’t want to go to native code extensions of Cython.
When analyzing a system top-down with a profiler, it’s easy to miss the forest for the trees. It helps to take a step back, and analyze the problem from first principles.
Unison is a functional language created by Rúnar Bjarnason and Paul Chiusano (the authors of Functional Programming in Scala, this link is for the new version for Scala 3). It is designed for distributed computing, and the surprising concept is that functions are just labels over hashes applied to ASTs. You can find out more in Trying out unison Part 1: code as hashes.
DuckDB keeps adding functionality at an impressive pace. In case you don’t know, DuckDB is an in-process OLAP database, i.e. what SQLite is in OLTP. The most recent addition is being able to query directly from Postgres. The DuckDB team has written all the details in DuckDB - Querying Postgres Tables Directly From DuckDB. It is a DuckDB plug-in offering a table scan for Postgres tables. It has predicate pushdown and is parallelized. Impressive overall.
If you have been using Kafka for long you will have seen the removal of ZooKeeper coming. ZooKeeper is (or was) used initially by Kafka to keep track of partitions, replica status and electing a cluster leader. In old versions (I think this changed in 1.0, maybe 2.0) clients would connect to ZooKeeper to get metadata details. This was removed, and changed to be done directly to a Kafka broker to reduce requests to ZooKeeper: you can keep the instances of Kafka large and those of ZooKeeper as small as possible. The last move has been moving all this metadata to Kafka itself and implementing a Raft-like consensus algorithm using Kafka primitives. Confluent has written a detailed post about this: Why ZooKeeper Was Replaced With KRaft - The Log of All Logs
Jamie Brandon (the creator of the HYTRADBOI database conference earlier this year and database jam this past week) wrote an extensive post about query planning for streaming systems last year. It covers real examples (materialize, ksqldb, Flink) and the complications appearing in real time. Surprisingly, using humans' judgement (hinting or rewriting the query) can be good enough compared with complex cost-based query planning.
Connected with Data Products we have data contracts. There is still no pure formulation of them, but the idea is establishing communication between service/product engineers and the data consumers using data created by them. Chad Sanderson has been talking about them extensively, and seems to be creating a product around them. A good one is this: The Rise of Data Contracts. If you are a data engineer, this quote from the article will hit home:
If you talk to almost any SWE that is producing operational data which is also being used for business-critical analytics they will probably have no idea who the customers are for that data, how it’s being used, and why it’s important.
Memory traditions (I would recommend you Memory Craft by Lynne Kelly, 6 stars out of 5 for me) have used representations that are in a way data visualizations. For example the Yakama Time Ball represents life events with knots and beads in a string.
I’m sure there is some data hero in your company. Someone who is always the first to solve a failure, who knows what lies deep in all those SQL files and can explain why it was done that weird way. Mikkel Dengsoe explains why you should treat them well in The Unsung Data Heroes.
Being data driven is the responsibility of your executive team Full stop. Getting an organization to be data driven without executive buy-in is a little like campaigning for better election oversight in a totalitarian state — kind of futile.
What is an insight? When your stakeholders request more insights from your data team… what are they expecting? Have a read to this Forbes piece, Insight Literacy: Why We Need to Clarify What Insights Really Are. You can find a great quote by Gary Klein (expertise researcher, author of several books I have recommended here) there, defining an insight as “an unexpected shift in the way we understand things.”
Documentation is the bane of most data orgs I know. Almost all data engineers, scientists and architects I know love documenting what they do, but are never offered the time or the incentives. You will find some ideas on how to address it (and why) in this post: Gamification of Data Knowledge.
Being more visible (writing, speaking at conferences, talking with people in other departments…) has many advantages, in particular that it increases your luck. Why? Because it increases the number of opportunities for something positive to happen. This is covered in this article from GitHub’s ReadME, Publishing Your Work Increases Your Luck and also in How to Increase Your Luck Surface Area.