2022#03 Readings4 minutes read | 682 words by Ruben Berenguel
Haven’t read much these days, but luckily I have not added much to the list either.
Currently my reading list stands at 99 books (2 read this year) and 10 articles (yay!). The book one is obviously the hardest to trim, because my goal of 50 read this year is tricky: it’s almost a book per week. Although I have already read 2, this week I haven’t finished any (maybe tomorrow, I have two with 50% read).
I have also started using Readwise (this link gives you 60 days to try it instead of 30, and may give me 30 more free to keep trying, if I don’t subscribe before that though, it’s worth it) to keep track (and be reminded of) annotations I take from books. I copy many of my book notes into Obsidian but not all of them, only either the actionable ones or the very enlightening ones. Readwise helps keep all others, and can sync with the plain-simple
My Clippings.txt from a Kindle.
Python Zip Imports: Distribute Modules and Packages Quickly
Although I prefer Scala for anything related to Spark, zip imports in Python can be a convenient way (when they work) to easily deploy additional user code to Spark executors.
Efficiently loading massive D3 datasets using Apache Arrow
Impressive amount of work for an equally impressive demo. I disagree with one comment mid-way in the article: the
pyarrow Python library is pretty decent, probably the work done in the Python script is better done in Pandas, which you can use to interact to Arrow later.
Why I switched my newsletters from Substack and Mailchimp to Buttondown
I have been considering moving from Mailchimp to Buttondown for more than a year. I try to streamline my processes as much as possible, and the preparation of the newsletter in MC takes 50%+ of the time the post+newsletter take. Unreasonable.
Ways I Use Testing as a Data Scientist
I was afraid when starting to read this article, but actually gets to mention hypothesis, pandera, Great Expectations and pytest, so it’s not a loss of time at all.
How to think about the ROI of data work
As mentioned in the article, there are several ways to think of data ROI but I think Mikkel makes a solid case for his, since it takes into account all the levels data covers (from deep down data engineering to up close with data driven product changes).
You future self will thank you: Building your personal documentation
One of recommendations in this article is Optimize notes for future searchability. I have been taking notes for many years, and the way I optimize my notes for searching is by writing everything in English (except in very rare occasions). This makes sure there is only one way to find information.
A man on a mission to preserve Barcelona’s decorative floor tiles
There is also a wonderful book (not by the man mentioned in the article) about these traditional Catalan tiles. It is only in Catalan and extremely expensive as second hand. But very recommended.
Scaling Kafka at Honeycomb
Instead of using a managed service, we’ve chosen to build expertise in-house, treating outages as unscheduled learning opportunities rather than reasons to fear Kafka.
This is a fascinating overview of the journey the Honeycomb team has travelled with Kafka. Includes several interesting technical hurdles and cost-reduction experiments.