Quite a common problem in data warehouse and analytics is to determine the value of a column relative to a previous record in a partition. For example you want to be able to group user events by user sessions with at least 24 hours interval between each other. In this case you can simply use a regular common table expression (CTE) and a LAG()
window function to find previous timestamp for each user event and then calculate if time difference is less 24 hours. Let’s take this dataset input
for example:
The last column is the desired output, a boolean indicating if we want to keep the record or not. Here is the SQL that we can use to achieve it:
Pretty simple, right? But what if you want to calculate the value, not based on the previous value, but the first, and then repeat the same approach for the remaining records within each partition? For example 24 hours since the first user event. Here is the same dataset, but the last column with expected result is changed to reflect the new requirements:
Notice rn
11 in the first dataset is set to be ignored, but with the new requirements we want to keep it because it's ts
is more than 24 hours from the last record for user_id 1
with keep IS TRUE
(rn
9). We can't use LAG()
function in this case anymore.
Here is where a FOR loop or a recursion can help us. In case of Amazon Redshift the RECURSIVE CTE has become available recently. Most examples of it demonstrate it’s usage with hierarchical data, but our use case is quite different.
At first we need to numerate the rows using ROW_NUMBER()
window function for each user in advance before we pass it into recursion because window and aggregate functions are not supported within it. It will produce a chronologically numbered column that starts with 1 for each user’s event record.
Then we can reference it in our RECURSIVE CTE. The first part of the CTE (before the UNION ALL
) identifies the anchor event for each user. The second part tries to match the following event by joining to itself (recursion) using the next row number. If the next event is not more than 24 hours from the previous, it passes the previous end date to the next iteration:
A recommendation from Amazon Redshift is to use an exit criteria to prevent very deep recursions and potential errors. In our case it can be set as a WHERE
clause on row number to be less than 100.
Thanks to Tubi for providing challenging tasks as part of my work so I can come up with more interesting solutions. You can find more big data problems and examples on Tubi Engineering.
1 rn | user_id | ts | keep
2----+---------+---------------------+------
3 1 | 2 | 2022-11-20 19:00:00 | t
4 2 | 1 | 2022-11-20 20:00:00 | t
5 3 | 1 | 2022-11-20 20:10:00 | f
6 4 | 1 | 2022-11-20 20:20:00 | f
7 5 | 1 | 2022-11-20 20:30:00 | f
8 6 | 1 | 2022-11-23 19:44:18 | t
9 7 | 1 | 2022-11-24 20:00:00 | t
10 8 | 1 | 2022-11-24 20:44:18 | f
11 9 | 1 | 2022-11-26 20:00:00 | t
12 10 | 1 | 2022-11-26 21:00:00 | f
13 11 | 1 | 2022-11-27 20:00:01 | f
14 12 | 1 | 2022-11-28 20:44:18 | t
15 13 | 1 | 2022-11-29 20:48:18 | t
16 14 | 1 | 2022-12-01 20:48:18 | t
17 15 | 1 | 2022-12-04 21:00:00 | t
18 16 | 1 | 2022-12-04 21:12:18 | f
1SELECT
2 *,
3 LAG(ts) OVER (PARTITION BY user_id ORDER BY ts) AS previous_ts,
4 previous_ts IS NULL OR ts NOT BETWEEN previous_ts AND DATEADD('hours', 24, previous_ts) AS keep
5FROM input;
1 rn | user_id | ts | keep
2----+---------+---------------------+------
3 1 | 2 | 2022-11-20 19:00:00 | t
4 2 | 1 | 2022-11-20 20:00:00 | t
5 3 | 1 | 2022-11-20 20:10:00 | f
6 4 | 1 | 2022-11-20 20:20:00 | f
7 5 | 1 | 2022-11-20 20:30:00 | f
8 6 | 1 | 2022-11-23 19:44:18 | t
9 7 | 1 | 2022-11-24 20:00:00 | t
10 8 | 1 | 2022-11-24 20:44:18 | f
11 9 | 1 | 2022-11-26 20:00:00 | t
12 10 | 1 | 2022-11-26 21:00:00 | f
13 11 | 1 | 2022-11-27 20:00:01 | t
14 12 | 1 | 2022-11-28 20:44:18 | t
15 13 | 1 | 2022-11-29 20:48:18 | t
16 14 | 1 | 2022-12-01 20:48:18 | t
17 15 | 1 | 2022-12-04 21:00:00 | t
18 16 | 1 | 2022-12-04 21:12:18 | f
1WITH RECURSIVE input AS (
2 SELECT 2 AS user_id, '2022-11-20 19:00:00'::TIMESTAMP AS ts
3 UNION ALL
4 SELECT 1 AS user_id, '2022-11-20 20:00:00'::TIMESTAMP AS ts
5 UNION ALL
6 SELECT 1 AS user_id, '2022-11-20 20:10:00'::TIMESTAMP AS ts
7 UNION ALL
8 SELECT 1 AS user_id, '2022-11-20 20:20:00'::TIMESTAMP AS ts
9 UNION ALL
10 SELECT 1 AS user_id, '2022-11-20 20:30:00'::TIMESTAMP AS ts
11 UNION ALL
12 SELECT 1 AS user_id, '2022-11-23 19:44:18'::TIMESTAMP AS ts
13 UNION ALL
14 SELECT 1 AS user_id, '2022-11-24 20:00:00'::TIMESTAMP AS ts
15 UNION ALL
16 SELECT 1 AS user_id, '2022-11-24 20:44:18'::TIMESTAMP AS ts
17 UNION ALL
18 SELECT 1 AS user_id, '2022-11-26 20:00:00'::TIMESTAMP AS ts
19 UNION ALL
20 SELECT 1 AS user_id, '2022-11-26 21:00:00'::TIMESTAMP AS ts
21 UNION ALL
22 SELECT 1 AS user_id, '2022-11-27 20:00:01'::TIMESTAMP AS ts
23 UNION ALL
24 SELECT 1 AS user_id, '2022-11-28 20:44:18'::TIMESTAMP AS ts
25 UNION ALL
26 SELECT 1 AS user_id, '2022-11-29 20:48:18'::TIMESTAMP AS ts
27 UNION ALL
28 SELECT 1 AS user_id, '2022-12-01 20:48:18'::TIMESTAMP AS ts
29 UNION ALL
30 SELECT 1 AS user_id, '2022-12-04 21:00:00'::TIMESTAMP AS ts
31 UNION ALL
32 SELECT 1 AS user_id, '2022-12-04 21:12:18'::TIMESTAMP AS ts
33), data AS (
34 SELECT
35 *,
36 ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts) AS rn
37 FROM input
38), test(user_id, rn, ts, end_of_24h_ts, keep) AS (
39 SELECT
40 user_id,
41 rn,
42 ts,
43 DATEADD('hours', 24, ts) AS end_of_24h_ts,
44 TRUE AS keep
45 FROM data
46 WHERE rn = 1
47 UNION ALL
48 SELECT
49 a.user_id,
50 a.rn,
51 a.ts,
52 CASE WHEN a.ts > b.end_of_24h_ts THEN DATEADD('hours', 24, a.ts) ELSE b.end_of_24h_ts END AS end_of_24h_ts,
53 a.ts > b.end_of_24h_ts AS keep
54 FROM data AS a
55 INNER JOIN test AS b ON a.user_id = b.user_id
56 AND a.rn - 1 = b.rn
57 WHERE a.rn < 100
58)
59SELECT
60 ROW_NUMBER() OVER(ORDER BY ts) AS rn,
61 user_id,
62 ts,
63 keep
64FROM test
65ORDER BY rn;