Find-the-gap with SQL in AWS Redshift
3 minutes read | 632 words by Ruben BerenguelA relatively common type of query for time-based SQL tables is a find the gap query. How can you do this in AWS Redshift, which does not have the SQL function generate_series
?
There is a basic scenario where this type of query is used:
A job that had to run periodically may have failed. When?
This type of problem is pretty typical in data engineering, when you want to backfill a table by re-running a job. My recommendation is to always, always do this type of computation in a programming language (in my work case, Python), but this can be done in pure SQL.
In a generic PostgreSQL database, you would use the function generate_series
from the minimum to the maximum in tni table, and then just join.
Sadly, AWS Redshift does not have generate_series
.
The function exists, and can be used BUT is a “driver-only” function. This means the computation is done in the driver. There is then no way to use it in a real query. It can be used for presentation purposes, like in SELECT * FROM generate_series(1, 30, 1)
but you won’t be able to join this (as a table expression) with a real table, because this cannot be sent to workers.
Instead, you can do the following, given a target_table
and a a_very_large_table
available in your Redshift database. Note that this “very” is relative, and will depend on the granularity of your frequency. In general, if you are using Redshift you probably have a large enough table somewhere. Otherwise, check the stl_
(statistical) system tables, some of these are large-ish.
I will briefly explain what the code does below.
WITH _target_table AS
(
SELECT *
FROM target_table
),
start_date AS
(
SELECT DAY::DATE AS dt
FROM _target_table
ORDER BY 1 ASC LIMIT 1
),
end_date AS
(
SELECT DAY::DATE AS dt
FROM _target_table
ORDER BY 1 DESC LIMIT 1
),
date_span AS
(
SELECT datediff('day',
(SELECT * FROM start_date),
(SELECT * FROM end_date))
FROM start_date
),
date_ranger AS
(
SELECT ((SELECT * FROM start_date) + rn)::DATE AS dt
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY 1) AS rn
FROM a_very_large_table)
WHERE rn <= (SELECT * FROM date_span)
ORDER BY dt ASC
)
SELECT date_ranger.dt,
CASE
WHEN b.dt IS NULL THEN FALSE
ELSE TRUE
END AS available_in_target_table
FROM (SELECT DAY::DATE AS dt
FROM _target_table
GROUP BY 1
ORDER BY 1 DESC) b
RIGHT JOIN date_ranger ON b.dt = date_ranger.dt
ORDER BY dt ASC;
There is nothing fancy going on, even if it may look like there is. I use common table expressions to simplify the final result and be able to name things.
- Alias the input table as CTE
_target_table
. - First get the initial date (CTE
start_date
). - Then get the end date (CTE
end_date
). - Get the date difference from these two, CTE
date_span
. If you have different frequency, change the frequency indatediff
. - The meat and potatoes is the CTE
date_ranger
, which usesa_very_large_table
’s row counts to generate a valid span fromstart_date
toend_date
. - Finally,
RIGHT
(orLEFT
, that’s up to your preference) join the target table with thedate_ranger
to find the gaps (NULL
s)
It’s not super-hard, but it’s kind of good to have it in a good copy-pastable way.
Even though CTEs are very handy to make SQL look readable, I have found some cases where the query planner in Redshift did not optimise a query in a CTE, whereas an embedded query was optimized.
I recommend you use CREATE TEMP TABLE AS (SELECT …)
and compare run times and query plans with the same as a CTE: it is relatively easy to go from one to the other, readability can be similar enough, but runtime could be very different.