Internals of Speeding up PySpark with Arrow - Ruben Berenguel (Consultant)

Created with glancer

a lot of people here are you're working with a spark right your hands yeah it's nice so I will start with what is pandas but if you are working with Python you probably already know this one then what is our role this may be Neil or maybe it's not Neil then a bit of how this part works I mean we are in sparks emits so probably you all know this but it's always good to remind how the pieces fit together
to remind how the pieces fit together because I when I explained how Pais part works because maybe you don't know that one it useful and then finally I'll cover how our row is making a pie spark way way faster so I'm Ruben a mathematician I work as a consultant from advertising company and I think writing Python for something like 15 years I mean I don't know an amp I didn't exist back then then I move to other languages but I still write a lot of Python day-to-day scale and Python
of Python day-to-day scale and Python now so I'm Pi spark code base it's kind of a sweet spot because I have a bit of Python every Java Scala it's a cool place to be so what is pandas I mean you may know it already it's a Python that analysis library if you see a job offer that says Python data pandas is involved in the code base somewhere and it's efficient wide efficient it because it's columnar when it has a CNS - back and so
columnar when it has a CNS - back and so the mechanism written is specifically in Python I mean the high-level API is but the rest is written in a very efficient implementation corner appears a lot in modern databases modern storage systems and what's the point of something being columnar same machine we have a table with a lot of columns and we want to sum one of the columns if you Resta curve you - you have a cpu and you have a row you have a cache line inside of the CPU
if you're thinking in rows and you want to some one column you need to go row by row loading the data and a little bit to the point where you can actually sum it when you can something is stored column wise you don't need to do it slowly you can just get everything from the column bringing it into the cash and some that was a bit first if you were in the previous table ie you kick that's really heartening to pronounce but he
really heartening to pronounce but he was doing the integration of spark are with ro and he also explained how CN the instructions make this fast so Panda is columnar how does it work internally in a high-level because I don't want to get into the details because I they are kind of complicated you have a data frame if you use pandas you know what it is it's basically a table with some names for the columns you have different types internally pan that is going to group
these types in memory by column index s of up to one place and then you bring together all the floats put them sideways in the fitting this way and bring them in memory same for Strings objects anything that is not an internal data type for pandas it's stored as an object block and integers there are several pandas data types that are
stored differently in memory altogether the blocks form a hundred by a block manager and when you request something being done to a problem you tell to the block manager give me this column and run this operation now we've done quickly to our Oh what is our always a library so you are unlikely to be using it unless you are writing your own library I mean you are just a client through pandas or through spark to the
other library it cause language it's columnar it in memory the key point is that it's across language so you can use the same internal structure in different languages and pass from one to the other and it's a bit optimized I mean the whole point of it it is written in C and there are implementations in Grasse go C Java obviously are there are not a lot of projects that are using it internally but pandas is one of
using it internally but pandas is one of using it internally but pandas is one of them them spark park were to use it is for something and Romero for what task which is a distributed computation framework kind of similar to spark but written entirely in Python gray which is basic something similar to dusk there is a POC of data processing written in rust using only a row there's a connector from Java to Scala with Python that is still experimental and it's using our role as
experimental and it's using our role as well an integration between our own pandas is seamless this by design I mean ro was created by the same developers that sorry that started pandas so the internal memory layouts are similar enough that the operations are really enough that the operations are really fast fast how does it look like in a row again you have a table this time I didn't bother
with the types and each set of rows is called a tracker patch I mean you can always call our track or throw a recording database link also throws records and you bring all the columns put them sideways and submit the data and consider that track or batch that's done for a certain number of rows and this can be done in a streaming basically the metadata says up to this point is this column up to this point is
point is this column up to this point is that other column and you can essentially skip the first if you only one the last one so you do can do it in streaming by just getting blocks of rows one to the other so at the high level you can think of a narrow table as set of record matches it's not exactly that that internal internal it's more complicated than this one from a high level this is good enough to understand how our looks like well our olaf's