a lot of people here are you're working
with a spark
right your hands yeah it's nice so I
will start with what is pandas but if
you are working with Python you probably
already know this one then what is our
role this may be Neil or maybe it's not
Neil then a bit of how this part works I
mean we are in sparks emits so probably
you all know this but it's always good
to remind how the pieces fit together
to remind how the pieces fit together
because I when I explained how Pais part
works because maybe you don't know that
one it useful and then finally I'll
cover how our row is making a pie spark
way way faster so I'm Ruben a
mathematician I work as a consultant
from advertising company and I think
writing Python for something like 15
years I mean I don't know an amp I
didn't exist back then then I move to
other languages but I still write a lot
of Python day-to-day scale and Python
of Python day-to-day scale and Python
now so I'm Pi spark code base it's kind
of a sweet spot because I have a bit of
Python every Java Scala it's a cool
place to be so what is pandas I mean you
may know it already it's a Python that
analysis library if you see a job offer
that says Python data pandas is involved
in the code base somewhere and it's
efficient wide efficient it because it's
columnar when it has a CNS - back and so
columnar when it has a CNS - back and so
the mechanism written is specifically in
Python I mean the high-level API is but
the rest is written in a very efficient
implementation corner appears a lot in
modern databases modern storage systems
and what's the point of something being
columnar same machine we have a table
with a lot of columns and we want to sum
one of the columns if you Resta curve
you - you have a cpu and you have a row
you have a cache line inside of the CPU
if you're thinking in rows and you want
to some one column you need to go row by
row loading the data and a little bit to
the point where you can actually sum it
when you can something is stored column
wise you don't need to do it slowly you
can just get everything from
the column bringing it into the cash and
some that was a bit first if you were in
the previous table ie you kick that's
really heartening to pronounce but he
really heartening to pronounce but he
was doing the integration of spark are
with ro and he also explained how CN the
instructions make this fast so Panda is
columnar how does it work internally in
a high-level because I don't want to get
into the details because I they are kind
of complicated you have a data frame if
you use pandas you know what it is it's
basically a table with some names for
the columns you have different types
internally pan that is going to group
these types in memory by column index s
of up to one place and then you bring
together all the floats put them
sideways in the fitting this way and
bring them in memory same for Strings
objects anything that is not an internal
data type for pandas it's stored as an
object block and integers there are
several pandas data types that are
stored differently in memory altogether
the blocks form a hundred by a block
manager and when you request something
being done to a problem you tell to the
block manager give me this column and
run this operation now we've done
quickly to our Oh what is our always a
library so you are unlikely to be using
it unless you are writing your own
library I mean you are just a client
through pandas or through spark to the
other library it cause language it's
columnar it in memory the key point is
that it's across language so you can use
the same internal structure in different
languages and pass from one to the other
and it's a bit optimized I mean the
whole point of it it is written in C and
there are implementations in Grasse go C
Java obviously are
there are not a lot of projects that are
using it internally but pandas is one of
using it internally but pandas is one of
using it internally but pandas is one of
them
them
spark park were to use it is for
something and Romero for what task which
is a distributed computation framework
kind of similar to spark but written
entirely in Python gray which is basic
something similar to dusk there is a POC
of data processing written in rust using
only a row there's a connector from Java
to Scala with Python that is still
experimental and it's using our role as
experimental and it's using our role as
well an integration between our own
pandas is seamless this by design I mean
ro was created by the same developers
that sorry that started pandas so the
internal memory layouts are similar
enough that the operations are really
enough that the operations are really
fast
fast
how does it look like in a row again you
have a table this time I didn't bother
with the types and each set of rows is
called a tracker patch I mean you can
always call our track or throw a
recording database link also throws
records and you bring all the columns
put them sideways and submit the data
and consider that track or batch that's
done for a certain number of rows and
this can be done in a streaming
basically the metadata says up to this
point is this column up to this point is
point is this column up to this point is
that other column and you can
essentially skip the first if you only
one the last one so you do can do it in
streaming by just getting blocks of rows
one to the other so at the high level
you can think of a narrow table as set
of record matches it's not exactly that
that internal internal it's more
complicated than this one from a high
level this is good enough to understand
how our looks like well our olaf's