A relatively common type of query for time-based SQL tables is a find the gap query. How can you do this in AWS Redshift, which does not have the SQL function generate_series?
There is a basic scenario where this type of query is used:
A job that had to run periodically may have failed. When?
This type of problem is pretty typical in data engineering, when you want to backfill a table by re-running a job. My recommendation is to always, always do this type of computation in a programming language (in my work case, Python), but this can be done in pure SQL.
In a generic PostgreSQL database, you would use the function generate_series from the minimum to the maximum in tni table, and then just join.
Sadly, AWS Redshift does not have generate_series.
Well, technically it does
The function exists, and can be used BUT is a “driver-only” function. This means the computation is done in the driver. There is then no way to use it in a real query. It can be used for presentation purposes, like in SELECT * FROM generate_series(1, 30, 1) but you won’t be able to join this (as a table expression) with a real table, because this cannot be sent to workers.
Instead, you can do the following, given a target_table and a a_very_large_table available in your Redshift database. Note that this “very” is relative, and will depend on the granularity of your frequency. In general, if you are using Redshift you probably have a large enough table somewhere. Otherwise, check the stl_ (statistical) system tables, some of these are large-ish.
I will briefly explain what the code does below.
WITH _target_table AS
SELECTDAY::DATE AS dt
SELECTDAY::DATE AS dt
SELECT ((SELECT*FROM start_date) + rn)::DATE AS dt
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY1) AS rn
WHERE rn <= (SELECT*FROM date_span)
ORDERBY dt ASC
CASEWHEN b.dt ISNULLTHENFALSEELSETRUEENDAS available_in_target_table
FROM (SELECTDAY::DATE AS dt
RIGHTJOIN date_ranger ON b.dt = date_ranger.dt
ORDERBY dt ASC;
There is nothing fancy going on, even if it may look like there is. I use common table expressions to simplify the final result and be able to name things.
Alias the input table as CTE _target_table.
First get the initial date (CTE start_date).
Then get the end date (CTE end_date).
Get the date difference from these two, CTE date_span. If you have different frequency, change the frequency in datediff.
The meat and potatoes is the CTE date_ranger, which uses a_very_large_table’s row counts to generate a valid span from start_date to end_date.
Finally, RIGHT (or LEFT, that’s up to your preference) join the target table with the date_ranger to find the gaps (NULLs)
It’s not super-hard, but it’s kind of good to have it in a good copy-pastable way.
Be careful with CTEs in Redshift
Even though CTEs are very handy to make SQL look readable, I have found some cases where the query planner in Redshift did not optimise a query in a CTE, whereas an embedded query was optimized.
I recommend you use CREATE TEMP TABLE AS (SELECT …) and compare run times and query plans with the same as a CTE: it is relatively easy to go from one to the other, readability can be similar enough, but runtime could be very different.