The Presto paper
This is the next installment on my quest to read and help understand interesting papers in the data space.
This turned shorter than usual for these posts because I’m still in “COVID brain” mode. If you read anything weird please ping me on twitter!
This is a summary of Presto: SQL on everything, the initial design paper of Presto from Facebook. Presto is in relatively widespread use (Meta, Netflix…). Starburst was a spin-off from Teradata initially supporting commercially Presto, over time the four authors of the paper joined it. Somewhat later, the Presto Software Foundation, maintaining PrestoSQL was founded… and later renamed PrestoSQL to Trino, since Presto was trademarked by Facebook (but the trademark had been previously transferred to the Linux Foundation). The history of Presto seems more complex than its design.
Presto is an adaptive, multi-tenant system capable of concurrently running hundres of intensive queries, and can scale to thousands of worker nodes. Efficiently. Its design is federated, so with one query you can request data from many different sources (wink Data Mesh, wink). It is designed for high performance, by using partitioning in smart ways and offering optimised query code generation.
It is a tool solving several use cases, like interactive small-scale analytics, batch ETL (or ELT) and fast analytics on larger scale datasets, like those required for A/B tests or advertiser analysis.
Architecture
A Presto cluster has a coordinator node and several worker nodes. The coordinator receives, parses, plans and optimizes queries. It also handles orchestration of queries across worker nodes, which are the ones processing the queries. The coordinator distributes the plan to the workers, and triggers the execution of tasks. Together with the tasks, it enumerates splits, which are handles to (I assume immutable) data chunks that tasks need. If you have read any previous paper in this series this will sound familiar. Workers then use cooperative multi-tasking to coordinate the handling of tasks.
Presto has an extension API, named in the article as the Connector API (currently part of the SPI plugin interfaces) which has 4 API parts: medatadata, location, source and sink. This is what makes realising an easy to query data mesh viable.
Presto uses what is mostly a standard, ANSI SQL dialect, parsed with ANTLR into a query tree which is logically planned and then physically planned. The query optimizer starts its work on logical plans by applying transformation rules, rules like constant folding. In the paper there is a mention to a plan to implement a cost based optimizer (like that one in the Cascades paper, the same used in Snowflake), which has been at least partially implemented in current Presto. The optimiser uses data locality to create more efficient physical plans too (connectors offer it via the location API, even at the level of specifying what indexes may be available), by preventing unneeded shuffles and handling colocated joins. Each connector can also offer predicate pushdown, likely defined in the source API. The paper doesn’t get into details about this.
Physical plans generate physical nodes with preferred properties, like particular requirements for shuffling, bucketing, sorting… These are taken into account when shuffling and distributing the plan across workers, and Presto will greedily go for a partitioning that satisfies as many of these as possible. Processes in the node can also be optimised for parallelism. These optimisations might not look as fancy as others, but for real time analytics many types of queries benefit from these greatly.
Query engine
Query plans sent to workers are split in stages (see the RDD post), each stage composed by tasks and stages separated by shuffles. Each task, though, might be broken into several pipelines (which could additionally be intra-node parallelizable), which are sequences of operators. An operator is a single, simple, change to the data, and Presto includes a code generator that tries to leverage JIT optimisations in the Java Virtual Machine. Scheduling goes from stage (which stages?) to task/split (how many, in which nodes?). Stages, which can be leaves or intermediate, can be scheduled all at once while minimising wall clock time, or phased. Phased execution finds strongly connected components of the (directed) query graph and starts them to prevent deadlocks, in topological sort order, which is a fancy way of saying “across dependencies in a dependency graph” (now available built-in in Python). Once tasks start running on leaf stages, the coordinator assigns small batchhes of splits to them, by requesting them from the connectors. This way of processing data improves throughput by making sure there is data constantly available to be computed, in a quasi-push model. In case it ws not clear, Presto uses a push model, not a pull (Volcano-like) model. If you have a good memory or have been paying attention, Snowflake also uses a push model.
The paper goes into additional details on resource management, but I didn’t find them that interesting. It also mentions that it can work directly with compressed file formats, but this is not unexpected nowadays.