2022#11 Readings 🇺🇦🌻
3 minutes read | 600 words by Ruben BerenguelA middle-of-the-week one because it’s Easter and I may not write this during days off.
I have been investigating Apache Druid and ClickHouse for partially work related reasons. Thus, it was time to write the post about the Druid design paper: my notes had been sitting on my reMarkable for many months.
📯 Apache Druid: analytical queries powered by magic
I wrote another piece in my Data Papers series.
Decision-Forcing Cases: Gaining experience without the hurt
This is an interesting idea for product people, isn’t it in the end similar to design interviews for software architecture?
🍿 What nobody tells you about documentation
This is the talk version of a post I shared in the past. It’s just 30 minutes, and totally worth your time if you work in a platform or enablement team.
Writing your First Distributed Python Application with Ray
A very straightforward tutorial for Ray.
Life Advice from NYC Chess Hustlers
Spoiler: there is not a lot of life advice, but they are some interesting interviews.
🍿 Data Microservices in Apache Spark using Apache Arrow Flight 👁️
👀 glanceI understood this as creating data microservices in Spark using Flight, but it seems to be the reverse: creating data microservices which can be queried (blindingly fast) by Apache Spark
Parallel Grouped Aggregation in DuckDB
The technical posts from DuckDB are always incredibly detailed and about something you rarely have to pay attention.
Novelist Cormac McCarthy’s tips on how to write a great science paper
These are great, although I don’t fully agree on the one for equations. The writing advice I got very early in my mathematical career is to treat equations as sentences, and roll with that.
Cost Efficiency @ Scale in Big Data File Format
The problems very large companies need to address are clearly non-trivial. For normal companies, this post can boil down to “use ZSTD compression”. But Uber had to go deeper to squeeze as many savings as possible.
Comparison of open source OLAP systems: Druid, Pinot, Clickhouse
I have been interested in Druid for a while 😉, and recently (because of Tinybird) have checked ClickHouse. I wasn’t very impressed with the architectural decisions (and information!) available for ClickHouse, so my assessment seems to coincide with this article: if you need to support multiple nodes of one of these… go for either of the Apache projects.
🍿 DuckDB – The SQLite for Analytics 👁️
👀 glanceI have used DuckDB locally on ocasions, when needing to have a look at a Parquet file where I didn’t need anything particularly pythonic, like just running a group and count, which are easier to write in SQL than Pandas. And it works well, really well. By the way, another thing I do somewhat regularly for some reason is checking header metadata of Parquet files. The most convenient tool for doing this is using Arrow’s read_metadata.