A middle-of-the-week one because it’s Easter and I may not write this during days off.
I have been investigating Apache Druid and ClickHouse for partially work related reasons. Thus, it was time to write the post about the Druid design paper: my notes had been sitting on my reMarkable for many months.
The problems very large companies need to address are clearly non-trivial. For normal companies, this post can boil down to “use ZSTD compression”. But Uber had to go deeper to squeeze as many savings as possible.
I have been interested in Druid for a while 😉, and recently (because of Tinybird) have checked ClickHouse. I wasn’t very impressed with the architectural decisions (and information!) available for ClickHouse, so my assessment seems to coincide with this article: if you need to support multiple nodes of one of these… go for either of the Apache projects.
I have used DuckDB locally on ocasions, when needing to have a look at a Parquet file where I didn’t need anything particularly pythonic, like just running a group and count, which are easier to write in SQL than Pandas. And it works well, really well. By the way, another thing I do somewhat regularly for some reason is checking header metadata of Parquet files. The most convenient tool for doing this is using Arrow’s read_metadata.