Find-the-gap with SQL in AWS Redshift3 minutes read | 632 words by Ruben Berenguel
A relatively common type of query for time-based SQL tables is a find the gap query. How can you do this in AWS Redshift, which does not have the SQL function
There is a basic scenario where this type of query is used:
A job that had to run periodically may have failed. When?
This type of problem is pretty typical in data engineering, when you want to backfill a table by re-running a job. My recommendation is to always, always do this type of computation in a programming language (in my work case, Python), but this can be done in pure SQL.
In a generic PostgreSQL database, you would use the function
generate_series from the minimum to the maximum in tni table, and then just join.
Sadly, AWS Redshift does not have
The function exists, and can be used BUT is a “driver-only” function. This means the computation is done in the driver. There is then no way to use it in a real query. It can be used for presentation purposes, like in
SELECT * FROM generate_series(1, 30, 1) but you won’t be able to join this (as a table expression) with a real table, because this cannot be sent to workers.
Instead, you can do the following, given a
target_table and a
a_very_large_table available in your Redshift database. Note that this “very” is relative, and will depend on the granularity of your frequency. In general, if you are using Redshift you probably have a large enough table somewhere. Otherwise, check the
stl_ (statistical) system tables, some of these are large-ish.
I will briefly explain what the code does below.
WITH _target_table AS ( SELECT * FROM target_table ), start_date AS ( SELECT DAY::DATE AS dt FROM _target_table ORDER BY 1 ASC LIMIT 1 ), end_date AS ( SELECT DAY::DATE AS dt FROM _target_table ORDER BY 1 DESC LIMIT 1 ), date_span AS ( SELECT datediff('day', (SELECT * FROM start_date), (SELECT * FROM end_date)) FROM start_date ), date_ranger AS ( SELECT ((SELECT * FROM start_date) + rn)::DATE AS dt FROM (SELECT ROW_NUMBER() OVER (PARTITION BY 1) AS rn FROM a_very_large_table) WHERE rn <= (SELECT * FROM date_span) ORDER BY dt ASC ) SELECT date_ranger.dt, CASE WHEN b.dt IS NULL THEN FALSE ELSE TRUE END AS available_in_target_table FROM (SELECT DAY::DATE AS dt FROM _target_table GROUP BY 1 ORDER BY 1 DESC) b RIGHT JOIN date_ranger ON b.dt = date_ranger.dt ORDER BY dt ASC;
There is nothing fancy going on, even if it may look like there is. I use common table expressions to simplify the final result and be able to name things.
- Alias the input table as CTE
- First get the initial date (CTE
- Then get the end date (CTE
- Get the date difference from these two, CTE
date_span. If you have different frequency, change the frequency in
- The meat and potatoes is the CTE
date_ranger, which uses
a_very_large_table’s row counts to generate a valid span from
LEFT, that’s up to your preference) join the target table with the
date_rangerto find the gaps (
It’s not super-hard, but it’s kind of good to have it in a good copy-pastable way.
Even though CTEs are very handy to make SQL look readable, I have found some cases where the query planner in Redshift did not optimise a query in a CTE, whereas an embedded query was optimized.
I recommend you use
CREATE TEMP TABLE AS (SELECT …) and compare run times and query plans with the same as a CTE: it is relatively easy to go from one to the other, readability can be similar enough, but runtime could be very different.