Hyperspace: An Indexing Subsystem for Apache Spark

hello everyone I hope you're having a wonderful day so far today we're gonna be talking about hyperspace and indexing subsystem for Apaches part before I begin talking about the details about the project let us introduce ourselves my name is Rahul petraju and I am joined here by my awesome colleague taken both of us are
from Microsoft we are part of the product that Microsoft has launched recently called Essure synapse analytics we are part of the SPARC team at Microsoft and it's probably obvious by now but I just say it we work on everything SPARC we offer spark as a service to Microsoft customers which includes both internal customers like office and Bing and also external customers where possible we contribute
back to us a spark and the open source majority of our work today I will be covering the background of the pilot project the vision some of the concepts that help you understand what's happening inside this project why is it something that you have to care about and my colleague here will showcase some real-world hosts to hopefully convince you by the end of this talk that hyperspace is an awesome deck so before
hyperspace is an awesome deck so before we dive into anything technical let's just start with the most foundational question which is what is an index I can give you the most obvious answer from the textbook in fact I just lifted this out of a textbook in databases an index is a data structure that improves the speed of data retrieval which in other words is query acceleration at the cost of additional rights and storage space but if you are anything like me then a real world analogy would be tremendously
real world analogy would be tremendously useful in just understanding what really is an index so let's take another crack at it right from since we were kids you might remember pretty much going to the back of your textbook trying to figure out like very exactly is a particular key phrase appearing in the textbook and at that point in time you might have come across and index from the back of your textbook for instance if I wanted to find where the textbook the phrase nested loop join up I would quickly go
to the section which has every word starting with the word at the letter n and I would look up nested loop join and I will immediately see that it's appearing between 718 and 722 page in the textbook and that's an index for you in short one can imagine an index is a shortcut to accelerating some of the quays so let's begin with the overview of hyperspace we have some pretty broad
goals in hyperspace to begin with our first and primary goal here is to be agnostic to data format our second goal is to offer a path towards low cost index metadata management what do we mean by this well we want to make sure that all of the index contents as well as the associated metadata is stored on the lake and this does not assume any other service to operate correctly our third goal is to enable multi engine in
drama for we want to be able enable extensible indexing infrastructure in other words what what you see here today is more of a beginning than an end and finally we want to make sure that we satisfy all of security privacy and compliance requirements at the bottom most as you can see here the only ISM is there is a data lake and on the data lake there are data sets that you already have or potentially structured