Data Microservices in Apache Spark using Apache Arrow Flight

Created with glancer

hi everyone thanks for coming to the webinar my name is Ryan Murray I've been working at jamia for about a year and I'm working there as the developer doing some open source projects and some research and that kind of stuff so today I wanted to share with you one of the cooler things that we've been doing lately which is our work on a jierou flight and in particular for this talk
some of the stuff with a smart connector and how that can apply to sort of building data my data micro services so for those of you who aren't aware I just thought I'd level set on what Apache arrow is get everyone on the same page before we go any further so everyone's really become sort of the the industry standard for in-memory data I'm not sure how many people have heard about it but as you can see it's it's in tons of different applications so it's used in
spark and indra mew and videos doing some interesting stuff with it on GPUs it's kind of spreading all over the place and that's a that's really by design one of the goals that we had when we first formed spark as an SRE arrow as a community was to make it a lingua franca for data the idea that if you store data in arrow format you'll get to in a few minutes you can there's all kinds of tools which you can leverage to
kinds of tools which you can leverage to do calculations on that data and there's understood and clear and standardized ways to move data between processes and between applications and machines Monon so since the first release of arrow back in December so 2016 you can see the grow has been growing exponentially every every month there's there's even more downloads part of that is the broad language support so you can see there's
over coming close to half a dozen languages now that it's implemented in some of these are sort of using the C libraries and others are native implementations these libraries that really helps out with a lingua franca part of it is every every programming language is sort of speaking the same language when it comes to data and as I said the the community's very active there's a like over 300 developers when we're doing a lot of interesting stuff making arrow
work on CPUs GPUs and more recently even a few genes so what is arrow well simply it's a in memory specification for data it tells you how to layout memory layout your data in memory in a binary format that makes it extremely efficient for analytical workloads large analytic Oracle's and that's irrespective of if you're on CPUs GPUs or or more exotic
things besides from that is a set of tools so you have the standard and then we've built a lot of tools in the community to help you manipulate that the data in that standard so you can think of that as you know sort of Lego bricks for building novel data applications some examples of this are yo getting data into and out of arrow from various formats whether it's a grow or Park a or something like that and
other things are compute kernels or engines which help you do calculations on the arrow even to things like flag which is a RPC mechanism or other other ways of trading data with other applications or processes it's important to us say what what arrow isn't and it isn't an installable system as such you can't go and download a copy of Harrow like you would spark and run it whereas it's a it's a library sparky uses arrow