The Presto paper

Jun 12, 2022 4 minutes read | 843 words

This is the next installment on my quest to read and help understand interesting papers in the data space.

Ray: Another way to distribute work in a cluster

May 23, 2022 8 minutes read | 1548 words

A new entry on the data papers series. Ray is a distributed framework for next generation AI applications. What does this mean? A scam? Blockchain on AI? Nah, it’s actually pretty cool, it has actors.

Apache Druid: analytical queries powered by magic

Apr 9, 2022 6 minutes read | 1085 words

It has been a while since my previous data paper. This time I tackle a less known one.

Lakehouse: It's like Delta Lake, but not really

Jan 19, 2021 5 minutes read | 1041 words

Lakehouse is the brand name for the underlying architecture of Databricks' Delta Lake: A data lake that is as performant as a data warehouse.

Down memory lane: the Hive paper

Jan 12, 2021 6 minutes read | 1124 words

Hive is arguably old. It is also undoubtedly useful, even now: 10 years after it was introduced.

The RDD paper: introducing the Spark general purpose framework

Nov 8, 2020 9 minutes read | 1909 words

This is the next instalment on my quest to read and help understand interesting papers in the data space.

Databricks' Delta Lake: high on ACID

Oct 12, 2020 15 minutes read | 3024 words

After reading the Snowflake paper, I got curious about how similar engines work. Also, as I mentioned in that article, I like knowing how the data sausage is made. So, here I will summarise the Delta Lake paper by Databricks.

Does Snowflake have a technical moat worth 60 billion?

Oct 2, 2020 15 minutes read | 3032 words

I didn’t know much about Snowflake, so I decided to have a look at its SIGMOD (ACM Special Interest Group on Management of Data) paper and investigate a bit more what special capabilities they offer, and how they compare to others.