Category: tech

Calculating dynamically anchored windows of data using Recursive CTE in Amazon Redshift

Quite a common problem in data warehouse and analytics is to determine the value of a column relative to a previous record in a partition. For example you want to be able to group user events by user sessions with at least 24 hours interval between each other. In this case you can simply use a regular common table expression (CTE) and a LAG() window function to find previous timestamp for each user event and then calculate if time difference is less 24 hours. Let’s take this dataset input for example:

1 rn | user_id |         ts          | keep
2----+---------+---------------------+------
3  1 |       2 | 2022-11-20 19:00:00 | t
4  2 |       1 | 2022-11-20 20:00:00 | t
5  3 |       1 | 2022-11-20 20:10:00 | f
6  4 |       1 | 2022-11-20 20:20:00 | f
7  5 |       1 | 2022-11-20 20:30:00 | f
8  6 |       1 | 2022-11-23 19:44:18 | t
9  7 |       1 | 2022-11-24 20:00:00 | t
10  8 |       1 | 2022-11-24 20:44:18 | f
11  9 |       1 | 2022-11-26 20:00:00 | t
12 10 |       1 | 2022-11-26 21:00:00 | f
13 11 |       1 | 2022-11-27 20:00:01 | f
14 12 |       1 | 2022-11-28 20:44:18 | t
15 13 |       1 | 2022-11-29 20:48:18 | t
16 14 |       1 | 2022-12-01 20:48:18 | t
17 15 |       1 | 2022-12-04 21:00:00 | t
18 16 |       1 | 2022-12-04 21:12:18 | f

The last column is the desired output, a boolean indicating if we want to keep the record or not. Here is the SQL that we can use to achieve it:

1SELECT
2  *,
3  LAG(ts) OVER (PARTITION BY user_id ORDER BY ts) AS previous_ts,
4  previous_ts IS NULL OR ts NOT BETWEEN previous_ts AND DATEADD('hours', 24, previous_ts) AS keep
5FROM input;

Pretty simple, right? But what if you want to calculate the value, not based on the previous value, but the first, and then repeat the same approach for the remaining records within each partition? For example 24 hours since the first user event. Here is the same dataset, but the last column with expected result is changed to reflect the new requirements:

1 rn | user_id |         ts          | keep
2----+---------+---------------------+------
3  1 |       2 | 2022-11-20 19:00:00 | t
4  2 |       1 | 2022-11-20 20:00:00 | t
5  3 |       1 | 2022-11-20 20:10:00 | f
6  4 |       1 | 2022-11-20 20:20:00 | f
7  5 |       1 | 2022-11-20 20:30:00 | f
8  6 |       1 | 2022-11-23 19:44:18 | t
9  7 |       1 | 2022-11-24 20:00:00 | t
10  8 |       1 | 2022-11-24 20:44:18 | f
11  9 |       1 | 2022-11-26 20:00:00 | t
12 10 |       1 | 2022-11-26 21:00:00 | f
13 11 |       1 | 2022-11-27 20:00:01 | t
14 12 |       1 | 2022-11-28 20:44:18 | t
15 13 |       1 | 2022-11-29 20:48:18 | t
16 14 |       1 | 2022-12-01 20:48:18 | t
17 15 |       1 | 2022-12-04 21:00:00 | t
18 16 |       1 | 2022-12-04 21:12:18 | f

Notice rn 11 in the first dataset is set to be ignored, but with the new requirements we want to keep it because it's ts is more than 24 hours from the last record for user_id 1 with keep IS TRUE (rn 9). We can't use LAG() function in this case anymore.

Here is where a FOR loop or a recursion can help us. In case of Amazon Redshift the RECURSIVE CTE has become available recently. Most examples of it demonstrate it’s usage with hierarchical data, but our use case is quite different.

At first we need to numerate the rows using ROW_NUMBER() window function for each user in advance before we pass it into recursion because window and aggregate functions are not supported within it. It will produce a chronologically numbered column that starts with 1 for each user’s event record.

Then we can reference it in our RECURSIVE CTE. The first part of the CTE (before the UNION ALL) identifies the anchor event for each user. The second part tries to match the following event by joining to itself (recursion) using the next row number. If the next event is not more than 24 hours from the previous, it passes the previous end date to the next iteration:

1WITH RECURSIVE input AS (
2  SELECT 2 AS user_id, '2022-11-20 19:00:00'::TIMESTAMP AS ts
3  UNION ALL
4  SELECT 1 AS user_id, '2022-11-20 20:00:00'::TIMESTAMP AS ts
5  UNION ALL
6  SELECT 1 AS user_id, '2022-11-20 20:10:00'::TIMESTAMP AS ts
7  UNION ALL
8  SELECT 1 AS user_id, '2022-11-20 20:20:00'::TIMESTAMP AS ts
9  UNION ALL
10  SELECT 1 AS user_id, '2022-11-20 20:30:00'::TIMESTAMP AS ts
11  UNION ALL
12  SELECT 1 AS user_id, '2022-11-23 19:44:18'::TIMESTAMP AS ts
13  UNION ALL
14  SELECT 1 AS user_id, '2022-11-24 20:00:00'::TIMESTAMP AS ts
15  UNION ALL
16  SELECT 1 AS user_id, '2022-11-24 20:44:18'::TIMESTAMP AS ts
17  UNION ALL
18  SELECT 1 AS user_id, '2022-11-26 20:00:00'::TIMESTAMP AS ts
19  UNION ALL
20  SELECT 1 AS user_id, '2022-11-26 21:00:00'::TIMESTAMP AS ts
21  UNION ALL
22  SELECT 1 AS user_id, '2022-11-27 20:00:01'::TIMESTAMP AS ts
23  UNION ALL
24  SELECT 1 AS user_id, '2022-11-28 20:44:18'::TIMESTAMP AS ts
25  UNION ALL
26  SELECT 1 AS user_id, '2022-11-29 20:48:18'::TIMESTAMP AS ts
27  UNION ALL
28  SELECT 1 AS user_id, '2022-12-01 20:48:18'::TIMESTAMP AS ts
29  UNION ALL
30  SELECT 1 AS user_id, '2022-12-04 21:00:00'::TIMESTAMP AS ts
31  UNION ALL
32  SELECT 1 AS user_id, '2022-12-04 21:12:18'::TIMESTAMP AS ts
33), data AS (
34  SELECT
35    *,
36    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts) AS rn
37  FROM input
38), test(user_id, rn, ts, end_of_24h_ts, keep) AS (
39  SELECT
40    user_id,
41    rn,
42    ts,
43    DATEADD('hours', 24, ts) AS end_of_24h_ts,
44    TRUE AS keep
45  FROM data
46  WHERE rn = 1
47  UNION ALL
48  SELECT
49    a.user_id,
50    a.rn,
51    a.ts,
52    CASE WHEN a.ts > b.end_of_24h_ts THEN DATEADD('hours', 24, a.ts) ELSE b.end_of_24h_ts END AS end_of_24h_ts,
53    a.ts > b.end_of_24h_ts AS keep
54  FROM data AS a
55  INNER JOIN test AS b ON a.user_id = b.user_id
56    AND a.rn - 1 = b.rn
57  WHERE a.rn < 100
58)
59SELECT
60  ROW_NUMBER() OVER(ORDER BY ts) AS rn,
61  user_id,
62  ts,
63  keep
64FROM test
65ORDER BY rn;

A recommendation from Amazon Redshift is to use an exit criteria to prevent very deep recursions and potential errors. In our case it can be set as a WHERE clause on row number to be less than 100.

Thanks to Tubi for providing challenging tasks as part of my work so I can come up with more interesting solutions. You can find more big data problems and examples on Tubi Engineering.

Read more about tech

Testing Client-Server Systems

#QA#integration#tests#system#engineering#Quality Assurance

Thoughts on performing Quality Assurance of Client-Servier systems with focus on Integration tests. Learn more about testing methodology

Moved to Next.js with Contentful CMS for my static site

#react#framework#SSG#NextJS

It's always tempting to try some new technology. And what could be better than to try to apply it on one's personal blog? Read on why I migrated from Gatsby

1WITH RECURSIVE input AS (
2  SELECT 2 AS user_id, '2022-11-20 19:00:00'::TIMESTAMP AS ts
3  UNION ALL
4  SELECT 1 AS user_id, '2022-11-20 20:00:00'::TIMESTAMP AS ts
5  UNION ALL
6  SELECT 1 AS user_id, '2022-11-20 20:10:00'::TIMESTAMP AS ts
7  UNION ALL
8  SELECT 1 AS user_id, '2022-11-20 20:20:00'::TIMESTAMP AS ts
9  UNION ALL
10  SELECT 1 AS user_id, '2022-11-20 20:30:00'::TIMESTAMP AS ts
11  UNION ALL
12  SELECT 1 AS user_id, '2022-11-23 19:44:18'::TIMESTAMP AS ts
13  UNION ALL
14  SELECT 1 AS user_id, '2022-11-24 20:00:00'::TIMESTAMP AS ts
15  UNION ALL
16  SELECT 1 AS user_id, '2022-11-24 20:44:18'::TIMESTAMP AS ts
17  UNION ALL
18  SELECT 1 AS user_id, '2022-11-26 20:00:00'::TIMESTAMP AS ts
19  UNION ALL
20  SELECT 1 AS user_id, '2022-11-26 21:00:00'::TIMESTAMP AS ts
21  UNION ALL
22  SELECT 1 AS user_id, '2022-11-27 20:00:01'::TIMESTAMP AS ts
23  UNION ALL
24  SELECT 1 AS user_id, '2022-11-28 20:44:18'::TIMESTAMP AS ts
25  UNION ALL
26  SELECT 1 AS user_id, '2022-11-29 20:48:18'::TIMESTAMP AS ts
27  UNION ALL
28  SELECT 1 AS user_id, '2022-12-01 20:48:18'::TIMESTAMP AS ts
29  UNION ALL
30  SELECT 1 AS user_id, '2022-12-04 21:00:00'::TIMESTAMP AS ts
31  UNION ALL
32  SELECT 1 AS user_id, '2022-12-04 21:12:18'::TIMESTAMP AS ts
33), data AS (
34  SELECT
35    *,
36    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts) AS rn
37  FROM input
38), test(user_id, rn, ts, end_of_24h_ts, keep) AS (
39  SELECT
40    user_id,
41    rn,
42    ts,
43    DATEADD('hours', 24, ts) AS end_of_24h_ts,
44    TRUE AS keep
45  FROM data
46  WHERE rn = 1
47  UNION ALL
48  SELECT
49    a.user_id,
50    a.rn,
51    a.ts,
52    CASE WHEN a.ts > b.end_of_24h_ts THEN DATEADD('hours', 24, a.ts) ELSE b.end_of_24h_ts END AS end_of_24h_ts,
53    a.ts > b.end_of_24h_ts AS keep
54  FROM data AS a
55  INNER JOIN test AS b ON a.user_id = b.user_id
56    AND a.rn - 1 = b.rn
57  WHERE a.rn < 100
58)
59SELECT
60  ROW_NUMBER() OVER(ORDER BY ts) AS rn,
61  user_id,
62  ts,
63  keep
64FROM test
65ORDER BY rn;