2022#25 Readings 🇺🇦🌻
5 minutes read | 970 words by Ruben BerenguelHad some stuff going on that ate all my available time.
Hopefully I’ll have more details soon. During this time I have also created a couple more Github projects:
- urls-to-epub-with-pandoc: convert the posts in the
examples
category in the High Scalability blog into an epub; - paprika-epub: convert a Paprika export into a recipe book in epub format.
Data, databases, data life
Databricks CTO: Making Our Bet on the Lake House
An interview with Matei Zaharia, where a lot of material is covered. The history of Spark and big data, the relationship with cloud providers, competing with Snowflake and views of the future.
Metrics of a Data Platform
I’ve been in the position of being forced (unsuccessfully though) to accept some metrics for a data platform that were nonsensical. The list in this post is quite the opposite, although getting your systems to cover all of them is significant (but important) work.
The Difficult Life of the Data Lead
From Mikkel Dengsoe’s newsletter. All of this hits close, close and hard. Many will be shared by engineering managers, but data is a world on its own.
Upgrade to Athena Engine Version 3 to Increase Query Performance and Access More Analytics Features
AWS is speeding up development of Athena, keeping it on par with PrestoDB/Trino. This new engine improves query planning performance (kind of a corner case, but can be significant for small ad-hoc queries) and better handling of Apache Iceberg data.
Zach Wilson on LinkedIn: By Changing the Sort Order of One of My Parquet Tables Today, I Was Able…
Spoiler: changing the order for sorting Parquet files impacts storage size, due to how data is binary packed in Parquet.
Partitioning in Postgres, 2022 edition
Partitioning (automatically) in Postgres has been possible for some time, but there are constant improvements with each release. Here you will find what is now available and what you can do with it. I’m itching to try some of these.
Hosting SQLite on Github pages
Sometimes you don’t need the fancy, fast and featureful Postgres: sometimes you can get by with just SQLite. And with some WASM magic and some tweaks, you can make it run in a static website.
Storytellers and System Builders
This is an excellent way of dividing data people. I’m partial to system building, although I enjoy storytelling from time to time. But I could not spend all my time in storytelling mode: creating, designing and implementing systems is how I get my “kick”.
Data science or math-y stuff
Anomaly Detection in Time Series: A Comprehensive Evaluation
A summary of a paper, evaluating many anomaly detection algorithms. A long, long time ago I was interested in AD, and remember facing this as a problem: too many options, and no clarity on which to choose. The conclusion of the paper though is not brilliant: for multivariate anomalies you need a multivariate algorithm, and in univariate data you should choose a univariate algorithm. Duh.
Deep IEEE-754 floating point sh*t going on in Python-land
Someone has been messing with my subnormals!
Beware of fast-math
These two are related. A flag for a compiler triggers an optimisation that cascades through the whole (or most) of the Python scientific stack and affects numbers when they are very close to zero. You can find the details in the two posts above, it’s kind of IEEE-754 obscure.
Work life
Be a Star or a Janitor
When you can choose at work, choose either:
- Enthusiastically tackle the projects and tasks that are just miserable. Be a janitor.
- Seek out the work you are uniquely capable of, or the opportunities where you have some comparative advantages. Be a star.
Does Communication Matter in Technical Interviewing? We Looked at 100K Interviews to Find Out
An analysis by an online interviewing company, and the results is that being better at the coding part of the interview has an extremely larger ROI compared with communication.
Engineering
Sound – Bartosz Ciechanowski
An interactive exploration of sound. It’s amazing.
The Three Tech Projects You Meet in Hell
If you’ve worked long enough, you will have been part of all of them. Or being on the receiving end of not getting a deliverable by them.
The Dangers of Assert in Python
The title makes it sound more scary than it should. If you use assert for “production style work” make sure the __debug__
flag is set to True
. It’s better if you don’t anyway.
How Images From NASA’s James Webb Space Telescope Get Their Iconic Look
Did you know, space pictures are hand-tuned and painted, right? Well, turns out that there is a lot more to this, there are “schools” which use different palettes for different chemical spectra.
Pack of two for the Casio FW91
TOTP Tokens on My Wrist With the Smartest Dumb Watch
Pimping my Casio with Oddly Specific Objects' alternate motherboard and firmware
Yup, this is about changing the internals of the classic Casio wristwatch for something weird. If this is the stuff you like, there are older posts about how to make it extremely water and pressure proof by filling it with oil.