Find-the-gap with SQL in AWS Redshift

November 29, 2020 3 minutes read | 632 words by Ruben Berenguel

A relatively common type of query for time-based SQL tables is a find the gap query. How can you do this in AWS Redshift, which does not have the SQL function generate_series?

There is a basic scenario where this type of query is used:

A job that had to run periodically may have failed. When?

This type of problem is pretty typical in data engineering, when you want to backfill a table by re-running a job. My recommendation is to always, always do this type of computation in a programming language (in my work case, Python), but this can be done in pure SQL.

In a generic PostgreSQL database, you would use the function generate_series from the minimum to the maximum in tni table, and then just join.

Sadly, AWS Redshift does not have generate_series.

Well, technically it does

The function exists, and can be used BUT is a “driver-only” function. This means the computation is done in the driver. There is then no way to use it in a real query. It can be used for presentation purposes, like in SELECT * FROM generate_series(1, 30, 1) but you won’t be able to join this (as a table expression) with a real table, because this cannot be sent to workers.

Instead, you can do the following, given a target_table and a a_very_large_table available in your Redshift database. Note that this “very” is relative, and will depend on the granularity of your frequency. In general, if you are using Redshift you probably have a large enough table somewhere. Otherwise, check the stl_ (statistical) system tables, some of these are large-ish.

I will briefly explain what the code does below.

WITH _target_table AS 
(
  SELECT *
  FROM target_table
),
start_date AS
(
  SELECT DAY::DATE AS dt
  FROM _target_table
  ORDER BY 1 ASC LIMIT 1
),
end_date AS
(
  SELECT DAY::DATE AS dt
  FROM _target_table
  ORDER BY 1 DESC LIMIT 1
),
date_span AS
(
  SELECT datediff('day', 
           (SELECT * FROM start_date), 
           (SELECT * FROM end_date))
  FROM start_date
),
date_ranger AS
(
  SELECT ((SELECT * FROM start_date) + rn)::DATE AS dt
  FROM (SELECT ROW_NUMBER() OVER (PARTITION BY 1) AS rn
        FROM a_very_large_table)
  WHERE rn <= (SELECT * FROM date_span)
  ORDER BY dt ASC
)
SELECT date_ranger.dt,
       CASE
         WHEN b.dt IS NULL THEN FALSE
         ELSE TRUE
       END AS available_in_target_table
FROM (SELECT DAY::DATE AS dt
      FROM _target_table
      GROUP BY 1
      ORDER BY 1 DESC) b
  RIGHT JOIN date_ranger ON b.dt = date_ranger.dt
ORDER BY dt ASC;

There is nothing fancy going on, even if it may look like there is. I use common table expressions to simplify the final result and be able to name things.

Alias the input table as CTE _target_table.
First get the initial date (CTE start_date).
Then get the end date (CTE end_date).
Get the date difference from these two, CTE date_span. If you have different frequency, change the frequency in datediff.
The meat and potatoes is the CTE date_ranger, which uses a_very_large_table’s row counts to generate a valid span from start_date to end_date.
Finally, RIGHT (or LEFT, that’s up to your preference) join the target table with the date_ranger to find the gaps (NULLs)

It’s not super-hard, but it’s kind of good to have it in a good copy-pastable way.

Be careful with CTEs in Redshift

Even though CTEs are very handy to make SQL look readable, I have found some cases where the query planner in Redshift did not optimise a query in a CTE, whereas an embedded query was optimized.

I recommend you use CREATE TEMP TABLE AS (SELECT …) and compare run times and query plans with the same as a CTE: it is relatively easy to go from one to the other, readability can be similar enough, but runtime could be very different.