Does Snowflake have a technical moat worth 60 billion?

The SIGMOD Snowflake paper and comparison with other companies in the data space

October 2, 2020 15 minutes read | 3032 words by Ruben Berenguel

I didn’t know much about Snowflake, so I decided to have a look at its SIGMOD (ACM Special Interest Group on Management of Data) paper and investigate a bit more what special capabilities they offer, and how they compare to others.

Snowflake (New York Stock Exchange: SNOW) has recently gone public, on an IPO (Initial Public Offering) that has valued the company at around 70 billion. It caught me by surprise. I knew Snowflake was well considered, but this valuation looked crazy. Hence, if you are curious about why Snowflake may (or may not) be worth this much, what its technological moat is and this kind of questions, read on.

This writing is divided in the following sections, loosely following the SIGMOD paper sections with the additions of IPO information and my final commentary.

Cloud ready
The Query Engine
The Query Optimiser
The Virtual Warehouse
Cloud Services Layer
IPO details
So?

Cloud ready

The first differentiating factor Snowflake offers is being built cloud ready. I was somewhat confused by this: weren’t MapReduce, Hadoop, etc built for large datacenters? Yes, but that does not mean cloud ready. What public clouds offer is (technically)

infinite compute scalability
infinite storage

Snowflake was designed to take advantage of these two properties. It does so by completely decoupling compute and storage. Although Hadoop’s HDFS (Hadoop Distributed File System) is used to decouple compute and storage, it is not really the case from a “cloud” point of view: managed Hadoop HDFS clusters are still compute resources. In the Snowflake case, it means that storage is handled by the cloud’s own blob/object storage (S3 for AWS). Relying on cheap object storage also allows for several improvements, like easy time-travel (you can just keep more copies of the data at different times).

Traditional database systems (arguably Snowflake should be considered a distributed compute engine, but they compare themselves to database systems) are in general coupled: the same instances that hold the data also execute the computation. Of course, this makes sense in most cases: the closest the data is to the computation engine, the less network transfer and more optimisation you can pull off from locality.

Having fully elastic compute means that the system can easily adapt to varying workloads. You don’t need to have idle instances while your team is not analysing data, and when Black Friday arrives you can have as much compute as you need (or can pay). This is very efficient, even from the point of view of Snowflake, where they recommend you fully shutdown your nodes when finished.

Most importantly, once you have infinite compute scalability, you can offer a pay as you go model, where storage is a fixed cost and compute expense depends on your use.

Note that to make cloud ready a reality, Snowflake had to exclusively focus on being a SaaS (Software as a Service) offering and discarding the lucrative on-premise market. This has implications: for some industries (banking would be the most obvious, medical may be another) having your private cloud in the basement is a strong requirement.

Offering a full SaaS also means that engineering teams no longer need to manually tweak and adapt settings (what was executorMemoryOverhead again?), since this is now deferred to the Snowflake engineers.

The second differentiating factor for Snowflake is in being able to handle semi-structured data. With semi-structured data, I mean (and I assume the paper authors do, too) having JSON objects in what would be a column of your data/table. I will address how this works and its implications when I delve into the technicalities of the paper.

All these look like strong points, but I already have a question: what is the technical advantage? What is their moat? After all, separating compute and storage is already offered by Google BigQuery and Amazon Athena (or Redshift Spectrum, or the new Redshift compute-only instances). Databricks with Delta Lake has transient clusters so you can pretty much scale compute infinitely as well, while keeping storage in object storage. Is there any difference?

Well, the answer is a partial yes, in that the technical details are interesting enough on its own. But, on their own they don’t seem to offer a large moat, since many of these competitors can basically upend them given enough time.

The Query Engine

Snowflake is based on a custom-built query engine. Not based on PostgreSQL or Hadoop based. According to their SIGMOD paper, everything is written from scratch (although there are no details of language or anything else, a check at their job openings hints at Java). It is also a shared-nothing system, which avoids problematic locks (and even more problematic synchronisation problems) across compute nodes.

Tables are horizontally partitioned in a compressed, columnar representation. That is, tables are sliced in N rows, and these rows are then stored columnar in your object storage, as immutable files. A bit like how Arrow or Pandas would handle blocks, I imagine. Internal pipelining in the query engine works in batches of rows.

One interesting point is that the block files have enough metadata in their headers to allow random access to the file: this leverages the underlying cloud storage random access to avoid reading more data than needed. Fewer data to read means less access to slow storage and less network transfer. Win-win-win.

It is fully ACID (Atomicity, Consistency, Isolation and Durability) compliant, using snapshot isolation: transactions see a consistent database due to the underlying immutability of the files. This is akin to what Delta Lake offers on Databricks (although Delta can have some ACID issues if my memory serves). Since the storage layer contains immutable files, failed (partial or full) computations can be re-run easily.

The engine is also vectorised, which is unsurprising given that it is columnar-based. Why would you write anything columnar from scratch if you don’t intend to use it as well as possible?

One interesting fact from the paper is that the engine is not based on the Volcano iterator model (what essentially most database systems use, and in particular Apache Spark uses if you ignore the code generation parts) but is based on a push-based model. There are not many details on how this works, not even a reference. This paper seems to compare Volcano and push engines (as well as provide additional references) and I intend to give it a look at some point. I love knowing how the database sausage is made.

added on 2021-06-21 Push/pull reference

This blog post does a very good work of explaining push and pull database engines, with examples.

The engine handles semi-structured data in a very interesting way. First, by using an optimistic type inference approach. That is, if you have a JSON object in a “column” and the field dat contains what looks like integers statistically, it will extract this column from the object. Then it will store it separately, with this type and compress it appropriately. Oh, and it will also store the original alongside, just in case, to prevent data loss (12345 is a fine integer until you find a record 00321 and it turns out to be a postcode).

Snowflake added several SQL extensions to access semi-structured data. This allows access to inner records either by using specific SQL-like functions or by using path-access style syntax. They also added time-travel extensions (like AT time or BEFORE time, or UNDROP). Of course this is a requirement to be able to handle semi-structured data, which is more or less the kind of data we all ingest day in and day out. Worth noting that flattening is done via lateral views, with no mention of particular optimisations. Lateral views can result in expensive queries.

Data inference and data loading generate zone maps for query optimisation. I can’t find a good, canonical reference of what zone maps are, but they are easy to describe quickly. Zone maps are used to pruning queries, and are kind of a cheap index, but for all columns. For instance, for one of the files a table is split into, the zone map would have the maximum and minimum values for each column (and some other “interesting” extrema). Then, once you are querying you can discard whole files if filter predicates are outside the extremes in the maps. Note that AWS Redshift uses zone maps internally, and so does Databricks Delta Lake (for the files stored in Delta Lake format in a cloud object storage).

The Query Optimiser

The query optimiser is a cascades-based optimiser (see this). According to this very succinct presentation from Carnegie-Mellon, Cascades is a reimplementation of the Volcano query optimiser which makes it easier to extend and faster. Cost-based optimisation is then plugged into the system, to prune the tree of possibilities. The lack of indices (or more precisely, the lack of allowing the user to tweak indices) means that performance is easy to understand, since it basically will either be fast or not.

Table statistics for the CBO are automatically collected on load and update of the tables, which is also where zone maps are computed. There is also some fancy interval arithmetic at play, where zone maps for relatively complex queries (like WEEKDAY(dt) IN (6, 7)) are also precomputed (or can be deduced from the existing map). Once zone maps are in place, filter and cast operations can be pushed down to the zone map scanning phase.

Dynamic pruning of files to process is deferred to the execution phase. For instance, during the build phase of a HashJoin information about the files is collected to be able to avoid loading files in the probe phase. It also tries to use BloomJoin to prevent excessive data transfer.

Data skew is handled internally via a process called file stealing. Once a computing node is finished scanning its set of files, it will check if other nodes still have full queues and steal files from them. Data will then be read straight from S3, effectively reducing the workload on that node.

The Virtual Warehouse

This is the name Snowflake gives computing nodes (as in a cluster) in the system. There is no direct access to these: users (clients) don’t interact with the machines or even the cluster, they can only issue queries. There is a 1-1 mapping from user queries to a virtual warehouse. When a user queries its database a virtual warehouse (VW) will be created for the query.

You can only choose the size of VWs abstractly, in T-shirt sizes. This is an advantage, since it means improvements on the underlying hardware in the cloud offerings are transparent to Snowflake users. The instances in the cluster of a VW are called working nodes. Since creation and destruction of VWs has no effect on the data stored (i.e. the database), users are heavily encouraged to adjust the size per query and spin down any resources when there is nothing going on.

Local disk is used heavily fors caching (simple LRU, storing full columns when needed), whereas intermediate computations are spilled directly into object storage to allow some optimised re-running of failed queries. Snowflake only uses instances with SSD storage, since this cache is most of the time to be a hot cache.

Cloud Services Layer

This is the core of the Snowflake service. It is a heavily multi-tenant and high availability set of services. As expected, services are REST under a service-oriented architecture (SOA). These are long-lived services, that serve all users. RBAC (role-based access control), key rotation and many other good security practices are handled at this level as well.

Many things go on this layer. It is where the optimiser and transaction manager run, and where data catalogue, file statistics, locks, and transaction logs are stored. According to the SIGMOD paper, all the data is stored in an unnamed scalable key-value store (given the age of the paper, that discards Redis ¯\_(ツ)_/¯ ).

This layer handles metadata updates and versioning, for internal schema evolution and service changes. When things change, a translation layer converts from one version to others, and the changes are “blue-greened” onto users. The underlying virtual warehouse instances only talk to the corresponding version of cloud services, so the update is easy to propagate. Queries running on old nodes are left to finish and eventually all users are on the new system.

added on 2021-06-21 FoundationDB powers the Cloud Services layer

I missed this reference from Snowflake themselves about how FoundationDB is used behind the scenes in the metadata layer.

IPO details

This is a small summary of the IPO prospect. Note that the estimate for the addressable market is a whooping 80 billion ($) market.

Snowflake has 3k customers, and is running more than 500 million queries per day. Of these 3k customers, 50 spend more than 1 million (per year). They have some pretty large customers according to the prospect, like Capital One (a bank) and Sainsbury’s (funnily, not mentioned by name in the prospect but I dare you to find another big retail chain in Europe selling clothes, food and financial services).

Snowflake was heavily tied to AWS in the beginning, but eventually (2018 for Azure and 2020 for GCP) became full multi-cloud. They handle it by defining a Snowflake region as a 1-1 mapping with a public cloud region. This also allows users to have multi-cloud Snowflake set-ups, which is probably useful for somebody.

But even if its supposedly multi-cloud, most of its clients are in AWS. So much so, that they have around 5 billion in AWS compute reserved for the next years.

Their revenue is slowly increasing, and I find quite interesting that their services revenue (in-house experts that are billed hard to set up stuff for clients) has significantly increased. They have also increased their expense in research and development. Even with the increase, they are still not profitable.

So?

Betteridge's Law of Headlines

We could already answer no given the title of this post.

Why is Snowflake (possibly) superior to other similar offerings? On the surface, both Databricks Delta Lake, AWS Redshift (or Redshift Spectrum/Athena) and Google BigQuery are relatively close:

You fully choose cluster size in Databricks, and the underlying storage setup is similar to Snowflake
The new compute-only nodes for Redshift fully decouple compute and storage. Normal Redshift also uses a similar approach to zone maps.
BigQuery is as well fully decoupled.

What are the differences? The main one is in the level of granularity.

You can set up your clusters in Databricks as you choose, but you need a cluster to issue queries, at least for now. You can’t just query your data, you connect to a running cluster that has access to this data. This may look pretty small, but in terms of usability it is a huge difference. The granularity at Databricks level is at the cluster level, whereas in Snowflake it is at the query level.

Additionally, the description of the Snowflake metadata seems more advanced than the one in Delta Lake, where you depend on what Parquet offers. Parquet offers random file access, but I can’t comment on how good the interaction between Spark and this is. Since Databricks Spark is significantly faster than AWS EMR Spark (around 2x for the use case I had when I measured), this may be due to random file access being used. Also, Databricks leverages local caching really hard. This seems to show that Snowflake has an edge over Databricks, in terms of cost of usability.

Redshift has similar internal semantics (zone maps, adds z-indexes as well) but lacks build-side optimisations for pruning in probes, or at least I don’t remember reading about these. And it’s a huge improvement! Distributed hash joins are the most common operations once you have reasonably sized data, speeding them up compounds quickly.

The new compute only nodes in Redshift are expensive. As in really costly. The whole point of Snowflake is running small when you do easy things/small data. Compute only Redshift does not offer that: for small-to-medium volumes of data they directly force you to use Spectrum or Athena (the smallest compute-only cluster is 5k$/month for 160 TB of accessible S3 data). This is pretty high, but is not helped by the pricing models of Spectrum or Athena, which charge per data scanned. If your queries are badly written, you are toast. This is because there is no capping to this cost, in Snowflake you cap it (per hour) by the size of your virtual warehouse. This looks like a win for Snowflake.

I’m not very familiar with any new offerings, but traditionally BigQuery suffers a similar flaw to Redshift Spectrum: it charges per computation, basically auto-spinning machines. If your query is badly written, you are going to (literally) pay for your mistake. This is mostly uncapped (I think there are new options now that may help with this), whereas in Snowflake it is.

Against these three possible competitors, Snowflake seems to have a clear edge. Worth 60 billion? I wouldn’t count on that for a long-term investment. Of course, I’m not an investor: Berkshire-Hathaway (Warren Buffet’s investment company) has invested 700 million, and they are famous for being long investors.

One kicker for me is that they have been called the new Oracle. Where is Oracle now? They are still a large business, they are pretty stable. In general, database systems are sticky: once you choose a vendor you keep it for pretty long. That is good, but traditionally all database system architectures have a relatively short timespan to profit.

There doesn’t seem to be any particular technology moat. The user experience and granularity seem to be excellent, but how hard would it be for any of the other parties to improve their offering to a similar point? It may not be a matter of a few months, but I wouldn’t put my money on Snowflake for long.

added on 2020-10-31 The streaming future

Although I didn’t mention streaming (since it’s not covered in the paper), I had streaming as an against for Snowflake. This turns out to be a mistake, Snowflake is working on a streaming solution (also see Tyler Akidau’s slides).

All images in this post are from:

Christopher Dresser’s Studies in Design