I don’t care that much about the HA scheduler (we only schedule batch, Airflow down would be a critical issue… that anyone in infrastructure/devops can fix in 15 minutes) but smart sensors and a better way to do xcom seem a godsent. Also, a better REST API makes it easier to trigger DAGs from DAGs (which can be better than subdags for certain operations, like cleanups, or vacuums).
I wasn’t aware broadcasted joins passed through the driver (it makes sense on hindsight). Recently I had to un-broadcast what used to be an easy “map side join” due to it growing until it passed 1GB, where the payoff of broadcast does not seem to compensate network transfer cost. AQE shoudl help in this case though, and the ideas here are interesting to see implemented.
Interesting. Technically I implemented entity resolution for our mapping system, although in our case it is fully deterministic (same cookie -> same user). It’s basically a very large (incremental) connected component computation in Spark. Our graph has 3 billion nodes!
I have said many times I suck at music but love making sounds. I recently got this iOS app (it is more or less a physical simulator for real instruments) and it is pretty impressive. Here’s a sample of play (not mine of course). The tutorial is a great showcase of features and guitar modelling.
This is a fascinating rabbit hole on dithering algorithms, inspired by the relatively famous game Return of the Obra Dinn, by Lucas Pope (of Papers, Please fame). I have checked a bit of references about dithering for tweaking images for my 7-color eInk display, and this was the icong on top.
I’m not sure I get the feature store craze. Most examples I have heard are glorified databases. After some comments from Uwe Korn, I think the idea of feature store would make sense if it was a store for functions or lambdas: you would define the extraction of the feature from the dataset and then you serve the feature against the target dataset.