2022#25 Readings 🇺🇦🌻

November 13, 2022 5 minutes read | 970 words by Ruben Berenguel

Had some stuff going on that ate all my available time.

Hopefully I’ll have more details soon. During this time I have also created a couple more Github projects:

urls-to-epub-with-pandoc: convert the posts in the examples category in the High Scalability blog into an epub;
paprika-epub: convert a Paprika export into a recipe book in epub format.

Data, databases, data life

Databricks CTO: Making Our Bet on the Lake House

An interview with Matei Zaharia, where a lot of material is covered. The history of Spark and big data, the relationship with cloud providers, competing with Snowflake and views of the future.

Metrics of a Data Platform

I’ve been in the position of being forced (unsuccessfully though) to accept some metrics for a data platform that were nonsensical. The list in this post is quite the opposite, although getting your systems to cover all of them is significant (but important) work.

The Difficult Life of the Data Lead

From Mikkel Dengsoe’s newsletter. All of this hits close, close and hard. Many will be shared by engineering managers, but data is a world on its own.

Upgrade to Athena Engine Version 3 to Increase Query Performance and Access More Analytics Features

AWS is speeding up development of Athena, keeping it on par with PrestoDB/Trino. This new engine improves query planning performance (kind of a corner case, but can be significant for small ad-hoc queries) and better handling of Apache Iceberg data.

Zach Wilson on LinkedIn: By Changing the Sort Order of One of My Parquet Tables Today, I Was Able…

Spoiler: changing the order for sorting Parquet files impacts storage size, due to how data is binary packed in Parquet.

Partitioning in Postgres, 2022 edition

Partitioning (automatically) in Postgres has been possible for some time, but there are constant improvements with each release. Here you will find what is now available and what you can do with it. I’m itching to try some of these.

Hosting SQLite on Github pages

Sometimes you don’t need the fancy, fast and featureful Postgres: sometimes you can get by with just SQLite. And with some WASM magic and some tweaks, you can make it run in a static website.

Storytellers and System Builders

This is an excellent way of dividing data people. I’m partial to system building, although I enjoy storytelling from time to time. But I could not spend all my time in storytelling mode: creating, designing and implementing systems is how I get my “kick”.

Data science or math-y stuff

Anomaly Detection in Time Series: A Comprehensive Evaluation

A summary of a paper, evaluating many anomaly detection algorithms. A long, long time ago I was interested in AD, and remember facing this as a problem: too many options, and no clarity on which to choose. The conclusion of the paper though is not brilliant: for multivariate anomalies you need a multivariate algorithm, and in univariate data you should choose a univariate algorithm. Duh.

Deep IEEE-754 floating point sh*t going on in Python-land

Someone has been messing with my subnormals!

Beware of fast-math

These two are related. A flag for a compiler triggers an optimisation that cascades through the whole (or most) of the Python scientific stack and affects numbers when they are very close to zero. You can find the details in the two posts above, it’s kind of IEEE-754 obscure.

Work life

Be a Star or a Janitor

When you can choose at work, choose either:

Enthusiastically tackle the projects and tasks that are just miserable. Be a janitor.
Seek out the work you are uniquely capable of, or the opportunities where you have some comparative advantages. Be a star.

Does Communication Matter in Technical Interviewing? We Looked at 100K Interviews to Find Out

An analysis by an online interviewing company, and the results is that being better at the coding part of the interview has an extremely larger ROI compared with communication.

TOTP Tokens on My Wrist With the Smartest Dumb Watch

Pimping my Casio with Oddly Specific Objects' alternate motherboard and firmware

Yup, this is about changing the internals of the classic Casio wristwatch for something weird. If this is the stuff you like, there are older posts about how to make it extremely water and pressure proof by filling it with oil.

Buy me a coffee

2022#25 Readings 🇺🇦🌻

Data, databases, data life

Databricks CTO: Making Our Bet on the Lake House

Metrics of a Data Platform

The Difficult Life of the Data Lead

Upgrade to Athena Engine Version 3 to Increase Query Performance and Access More Analytics Features

Zach Wilson on LinkedIn: By Changing the Sort Order of One of My Parquet Tables Today, I Was Able…

Partitioning in Postgres, 2022 edition

Hosting SQLite on Github pages

Storytellers and System Builders

Data science or math-y stuff

Anomaly Detection in Time Series: A Comprehensive Evaluation

Deep IEEE-754 floating point sh*t going on in Python-land

Someone has been messing with my subnormals!

Beware of fast-math

Work life

Be a Star or a Janitor

Does Communication Matter in Technical Interviewing? We Looked at 100K Interviews to Find Out

Engineering

Sound – Bartosz Ciechanowski

The Three Tech Projects You Meet in Hell

The Dangers of Assert in Python

How Images From NASA’s James Webb Space Telescope Get Their Iconic Look

Pack of two for the Casio FW91

TOTP Tokens on My Wrist With the Smartest Dumb Watch

Pimping my Casio with Oddly Specific Objects' alternate motherboard and firmware