hello everyone I hope you're having a
wonderful day so far today we're gonna
be talking about hyperspace and indexing
subsystem for Apaches part before I
begin talking about the details about
the project
let us introduce ourselves my name is
Rahul petraju and I am joined here by my
awesome colleague taken both of us are
from Microsoft we are part of the
product that Microsoft has launched
recently called Essure synapse analytics
we are part of the SPARC team at
Microsoft and it's probably obvious by
now but I just say it we work on
everything SPARC we offer spark as a
service to Microsoft customers which
includes both internal customers like
office and Bing and also external
customers where possible we contribute
back to us a spark and the open source
majority of our work today I will be
covering the background of the pilot
project the vision some of the concepts
that help you understand what's
happening inside this project why is it
something that you have to care about
and my colleague here will showcase some
real-world hosts to hopefully convince
you by the end of this talk that
hyperspace is an awesome deck so before
hyperspace is an awesome deck so before
we dive into anything technical let's
just start with the most foundational
question which is what is an index I can
give you the most obvious answer from
the textbook in fact I just lifted this
out of a textbook in databases an index
is a data structure that improves the
speed of data retrieval which in other
words is query acceleration at the cost
of additional rights and storage space
but if you are anything like me then a
real world analogy would be tremendously
real world analogy would be tremendously
useful in just understanding what really
is an index so let's take another crack
at it right from since we were kids you
might remember pretty much going to the
back of your textbook trying to figure
out like very exactly is a particular
key phrase appearing in the textbook
and at that point in time you might have
come across and index from the back of
your textbook for instance if I wanted
to find where the textbook the phrase
nested loop join up I would quickly go
to the section which has every word
starting with the word at the letter n
and I would look up nested loop join and
I will immediately see that it's
appearing between 718 and 722 page in
the textbook and that's an index for you
in short one can imagine an index is a
shortcut to accelerating some of the
quays so let's begin with the overview
of hyperspace we have some pretty broad
goals in hyperspace to begin with our
first and primary goal here is to be
agnostic to data format our second goal
is to offer a path towards low cost
index metadata management what do we
mean by this well we want to make sure
that all of the index contents as well
as the associated metadata is stored on
the lake and this does not assume any
other service to operate correctly our
third goal is to enable multi engine in
drama for we want to be able enable
extensible indexing infrastructure in
other words what what you see here today
is more of a beginning than an end and
finally we want to make sure that we
satisfy all of security privacy and
compliance requirements at the bottom
most as you can see here the only ISM is
there is a data lake and on the data
lake there are data sets that you
already have or potentially structured