2022#24 Readings 🇺🇦🌻
10 minutes read | 2078 words by Ruben BerenguelTrying a cross between the old format and the new format (since there are people who like both)
Data engineering
Next generation design for Materialize
Materialize, a “streaming database” has just announced its next generation design and engine. It now offers strict serializability as a default consistency model, by implementing virtual time. From far enough, the idea of virtual time seems to be similar to Lamport clocks but with the twist that the processing is allowed to “roll back in time” to process laggards. You can read the paper here, I might write down some thoughts on it for the data papers series at some point. If you need to process streaming data, Materialize and Tinybird could be two options to consider if you don’t want to roll your own “Kafka hell”. If you want to suffer (and have fun) you can follow the ideas in this Hacker News comment on how to implement a “poor man’s Materialize”.
Stretching My Legs in the Data Engineering Ecosystem in 2022
Robin Moffat, aka that Kafka guy has been delving into Data Engineering topics as of late in his blog, with the first piece being Stretching My Legs in the Data Engineering Ecosystem in 2022. I think the posts are very interesting for data engineers because he comes at it from a different angle than if you have been in the “data trenches” for a while. Of particular importance is how “data engineer” (of the old times) has split into “data engineer” and “analytics engineer” (and possibly, “machine learning engineer”).
Generic data stuff
How the Hottest VC-Backed Cloud Data Companies Are Prepping for the Funding Winter
This article is 2 months old, but still very relevant. It covers what the “cool data companies” are doing now that the investment market is cooling. It includes details about Databricks, dbt labs, Monte Carlo, Airbyte and Fivetran. The most surprising details for me are how low annualized revenues are vs valuations, particularly for “those that are not Databricks” in the list.
7 Key Learnings From GoCardless’ Experience Implementing Data Contracts
Above I mentioned Monte Carlo, and this comes from their blog. And again, about data contracts. This article is pretty detailed and includes a series of key learnings that will be useful if you want to implement your own contracts. It also includes how the process was structured to onboard and convince engineering on adopting the contracts, which is probably one of the hardest hurdles in adopting them. If you are curious on the contract protocol they use, it is based on jsonnet.
Do’s and Don’ts of Data Mesh
Another classic theme in the modern data world. An important recommendation in the don’ts: don’t go by the book. I think this is something a lot of people forget about any of these “frameworks” that come with a book (data mesh, clean code, domain-driven design…) and is that every organization, department and system has its own quirks and very often you need to adapt to them instead of trying to fit the square peg in the round hole.
Topology of a Data Product Team
This article can be a good introduction to Team Topologies, particularly when applied to data teams. It may surprise some of you, but there are a lot of people in data leadership positions in “modern” companies that have no idea what team topologies are. I don’t think the book is ground-breaking myself, but the characterization it provides is helpful to understand what a team should and should not do, and where there may be a missing link in the chain.
Querying Newline-Delimited JSON Logs Using AWS Athena (using GPT-3 to create Athena table definitions!)
This is a section of a TIL (today I learned) from Simon Willison where he shows how you can use GPT-3 to create an Athena schema out of a JSON document. If you have ever used Athena, or Glue, or Hive… you’ll love this.
Bad Data Analysis Questions I See Every Week
For the data analysts among us. I think I have been asked all of these questions, repeatedly. And I have seen teams waste endless hours trying to come up with answers to them. Don’t be like them. A quote from it, where I could not agree more:
They’d say that analysts are just there to answer stakeholders’ questions and get data quickly. I don’t think that’s the best practice, and neither is the other school of thought in data, where the analyst is viewed as a strategic partner who’s equipped to come up with those hypotheses on their own. As with many dichotomies, I think the truth is in the middle.
Which Fonts to Use for Your Charts and Tables
This comes from Lisa Charlotte Muth writing in Datawrapper’s blog. If you enjoy anything about data visualization, you’ll love their blog and this post in particular. If you are not that interested (it’s a very long post), the TL;DR:
When in doubt, set your text in a font that’s easy to read. Easy to read is everything that readers are used to. On the web, that means sans-serif, neither overly narrow nor wide, regular (instead of bold or thin) text set in sentence case, in a size that’s big enough to read, and in black or almost black.
Gallery of Physical Visualizations and Related Artifacts
Last week I shared one link from this collection, but the whole gallery is worth a look for anybody interested in data visualization.
Forecasting Something That Never Happened: How We Estimated Past Promotions Profitability
This is a relatively important problem to measure uplift of things you actually did, and very common in advertising too. As you can imagine, it’s based on training a model on non-promotion days and then predicting what the promotion day would have looked like without a promotion. Of note, they started by using Facebook’s Prophet library but eventually moved to XGBoost.
Software Engineering
Checking Statistical Properties of Protocols Using TLA+
I am a lightweight fan of formal verification (I prefer Alloy, but my only talk about formal verification was with TLA+). This post introduces an addition that is as fresh as possible: announced in this year’s TLA+ conference and is only available in the nightly build. It introduces something I have always wanted: stats. With TLA+ models, until now, you could only say things like eventually the system will do X. Now you can instead let the model run for a while and get something like 93% of the time the system won’t ever do X. I’ll try to think of an interesting example related with data I can write about here.
How to Query (Almost) Everything
This is the blog post related to this presentation of this year’s HYTRADBOI conference. The presentation is just 10-15 minutes long, and is super interesting. Just go watch it, and once you have done so, the repo you want to check is this one.
Functional Programming in Go
But don’t expect monads. This is an introductory post to Go’s generics, with implementations of map, filter and others. The author has written several (quite good) Go books, so this is good introductory material if you have wanted to try generics but didn’t know where to start.
Scripting with Scala
With his previous post about Scala (Scala isn’t fun anymore) it looked as if Alex was tired of Scala… but then he wrote this interesting post about using Scala for scripting. The questions you are probably thinking of:
- How? Using Scala CLI;
- Aren’t they terribly slow to start due to the JVM? No, you can either get it to use Scala Native (see here) or GraalVM (here).
Zig, the small language
You may not have heard about Zig (Rust is noisier, cooler, etc), but it is also a “replacement for C” language. As shown in this article, it is an extremely small language (unlike Rust). There’s a code block in the article that shows all syntax from Zig in 50 lines.
Fully-Typed Python Decorator With Optional Arguments
This is probably very niche, but last time I wrote a Python decorator I couldn’t really think of how to properly type it. This post explains all you need to know to do it properly. And all I need to know to do it properly next time, too!
Writing
Writing for Engineers
If you know me in person or in work situations, you’ll know I’m heavy on writing and in the importance of writing good documentation. This article has a good set of recommendations to make your writing flow smoothly.
Motivating Developers to Care About Documentation
On theme with the post above, but this one covers how to shape the engineering culture to create space to write documentation. Many engineering leaders say they like documentation, but put nothing in place to make sure there is time and space for documentation to happen.
Putting Amazon’s PR/FAQ to Practice
The PR/FAQ system was explained in Working Backwards. I read it some time ago, and although I am not a fan of the book, Cedric Chin (author of this post) is. Since he’s smart, I am assuming he sees something I don’t and thus want to keep this post for the future. He has been trying to apply the method to his own projects, and here there are some details on do’s and don’ts as well as some lists of things to remember to cover in the PR and the FAQ parts.
Miscellaneous
How to Realize Various Actions in a One-Button Game
I’m not sure how I landed on this article, but it turned out to be written by the developer of a game I really enjoyed years ago, Torus Trooper (a Tempest clone). There are a lot of playable examples in the article, with several different one button mechanics. Coincidentally, these days I have beaten a one-finger (but not one-button) on iOS (Apple Arcade exclusive), Bleak Sword. You can read Polygon’s review from a few years ago in Bleak Sword Is a Deceptively Brutal Mobile Game.
Why the Number 137 Is One of the Greatest Mysteries in Physics
This is the fine-structure constant, a unit-less constant that appears in several physical computations. A quote from Nobel prize Wolfgang Pauli:
When I die my first question to the Devil will be: What is the meaning of the fine structure constant?
Note that 137 is almost 42 in base 34.
How to Write Good Prompts: Using Spaced Repetition to Create Understanding
I have been using Anki extensively (again) and as usual, knowing what to add and how to add it are always the two hardest problems in spaced repetition. Andy Matuschak (and Michael Nielsen) have given a lot of thought in how to best use spaced repetition, and in this article you can find some good ideas. In case you don’t see the point of memorising things when you can look them up, I found a very good quote here, where he addresses cooking.
I’d cooked fairly seriously for about a decade before I began to use spaced repetition, and of course I naturally internalized many core techniques and ratios. Yet whenever I was making anything complex, I’d constantly pause to consult references, which made it difficult to move with creativity and ease. I rarely felt “flow” while cooking. My experiences felt surprisingly similar to my first few years learning to program, in which I encountered exactly the same problems.
To reach a good state of flow you need to have a large amount of context and knowledge at your fingertips, otherwise you need to stop to look up.
- Streaming
- Data Contracts
- Data Mesh
- Team Topologies
- AWS Athena
- Data visualization
- TLA+
- Golang
- Python
- Scala
- Zig
- Writing
- Anki
- ReadingsOfTheWeek
- Readings