The lakehouse architecture is the newest iteration in data warehousing. The goals of this architecture are:
To be based on open data formats (like Parquet or ORC), and allow direct access to the underlying data.
Offer first-class support for machine learning and data science workloads.
Have state-of-the-art performance.
As you can see, these range from I could do that to state-of-the-art performance. The key lens to see these goals is existing industry trends.
First generation data warehouses were used to dump data from operational databases to support decisions and BI (Business Intelligence). Data would be written with schema on write, to speed up BI processes. These were usually some forms of (distributed or just large) RDBMS (relational database management system).
Schema on write
Schema on write means that the data is saved according to a specified schema. The basic example would be inserting data into a table: the data needs to conform with the table schema. This is in contrast with schema on read, where you just save data and specify what you want to load from it when reading.
Next, focus moved to data lakes. A data lake is a high-volume storage layer, using low-cost storage and a file API with files stored using an open format. The quintessential and initial example would be HDFS (Hadoop Distributed Filesystem). Data would be dumped in the lake, and copied via ETL (extract-transform-load) or ELT (just swap letters).
After some time, cloud-based systems (Amazon’s S3, Azure’s Blob Storage, Google’s Cloud Storage) replaced HDFS as storage layer in the data lake. This change increased durability, reduced cost and offered archival layers, like S3 Glacier. The data warehouse layer likewise changed to cloud offerings like AWS Redshift or Snowflake. The datalake+data warehouse architecture became dominant.
The problem with this organisational layout, and a persistent headache most data engineers share, is that you have two systems that need to be kept in sync. This means that you have to keep reliability of your ETL system up, and prevent the warehouse from becoming stale. There would be limited support for machine learning. Warehouses were ready for BI, lakes for bulk access. And the final straw is the total cost of ownership: you pay for storage twice, once for the data lake and again for the warehouse.
An obvious solution would be what if we could use the lake as if it was a data warehouse?
Some trends are already converging to similar solutions. Data warehouses were adding support for external tables, to be able to access data stored in a data lake in Parquet or ORC. Also, a surge of investment in SQL engines against data lakes, like SparkSQL, Presto (now known as Trino), Apache Hive and AWS Athena. One issue with them: these engines lack full ACID guarantees (Atomicity-Consistency-Isolation-Durability).
A lakehouse combines these ideas. Using low cost and directly accessible cloud storage, while offering DBMS features. Features like ACID, versioning, indexing, caching and query optimisation. Of course, sounds good.
Databricks’ Delta Lake (which is open source, although Databricks has some internal extensions) is one possible lakehouse implementation. It uses S3 (or some other cloud storage) as a cheap storage layer, Parquet as the open storage format and a transactional metadata layer. If you are curious, I summarised the Delta Lake paper here. Some similar projects, Apache Iceberg and Apache Hudi can also be considered lakehouse implementations, the ideas behind them are similar to Delta Lake.
All these systems are based on adding a metadata layer on top of cloud storage. This layer is good to keep track of transactions (a good step to build ACID features), and allow time travel, data quality enforcement, and auditing. Also, a metadata layer should have no significant performance impact. Note that in general this metadata layer will allow zero-copy: if the format you have stored already in the cloud is the same as that of the lakehouse, adding data is just adding a line to the metadata with “the data is these files”, hence zero copy.
Metadata is not good enough on its own for good performance (although it can help for some things, like the “many small files” issue). To get high performance, you need to treat the data lake as a database engine would. Things like keeping hot data in SSD (solid-state drives). Or having indexes and auxiliary data structures like column maximum and minimum (per file! To have early pruning). At the state-of-the-art level, data layout and record reordering optimisations (like Z-ordering).
New execution engine
Databricks has a new C++ execution engine, which has either similar or better performance on smaller machines compared with the Spark DataFrame/RDD execution engine and competitors.
These optimisations work across storage layers, meaning a lakehouse with these can use anything as storage as long as the execution engine has these optimisations.
Delta Lake (or Iceberg, Hudi) are not the only possible lakehouse designs.
Different options for metadata layers.
Adding support for cross-table transactions (which Delta Lake does not offer at the moment).
Different caching backends.
Creating new storage formats that are optimised for a lakehouse or machine learning situation.
Leverage serverless somehow.
Query pushdown for machine learning processes.
Some are more far-fetched than others. Creating new storage formats is not easy, and specially optimised for such a wide range of options. Leveraging serverless is slightly more interesting, although how. One idea could be having cached data on the edge, for extremely high speed for common queries. The idea of having query pushdown seems a bit more far-fetched, but if the concept of feature store gets mainstream, features could then be pushed down to the storage/reading layer.
All in all, the ideas in the Lakehouse paper are worthwhile, although I find the details in the Delta Lake paper more useful. In other words, if I had to implement a lakehouse, I’d check that paper instead.