如何将CRUD状态更改日志表转换为带有日期和实体列的表?

Yos*_*Yos 5 sql apache-spark-sql

有一个更改日志表,用于跟踪实体(父/子)及其状态(新/现有/已删除):

\n
| timestamp              | parent | child | status   |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-06-10 19:10:25-07 | p1     | c1    | new      |\n| 2022-06-12 19:10:25-07 | p1     | c1    | existing |\n| 2022-06-14 19:10:25-07 | p1     | c1    | deleted  |\n| 2022-06-10 19:10:25-07 | p2     | c1    | new      |\n| 2022-06-12 19:10:25-07 | p2     | c1    | deleted  |\n
Run Code Online (Sandbox Code Playgroud)\n

当一个child实体处于状态时new,意味着它已被创建,当状态为时,existing意味着child保持存在,当状态为时deleted,意味着child不再相关。上表中的每一行都是由手动作业生成的,因此不能保证该作业会定期运行。这意味着具有existing状态的行是可选的new,而对于/deleted的每个组合肯定会出现 和。parentchild

\n

我想将表转换为以下结构:

\n
| date       | parent | child |\n| ---------- | ------ | ----- |\n| 2022-06-10 | p1     | c1    |\n| 2022-06-11 | p1     | c1    |\n| 2022-06-12 | p1     | c1    |\n| 2022-06-13 | p1     | c1    |\n| 2022-06-10 | p2     | c1    |\n| 2022-06-11 | p2     | c1    |\n| 2022-06-10 | p2     | c1    |\n
Run Code Online (Sandbox Code Playgroud)\n

其中每个日期都包含该日期的活动parent/实体行。child

\n

需要指出的是\xe2\x80\x99

\n
    \n
  1. 可以多次创建/删除相同的父/子组合。
  2. \n
  3. 结果表应该从当前日期的角度运行。也就是说,如果new某些父/子组合有一个事件,2022-06-12 19:10:25-07并且没有删除的事件,并且今天是 2023 年 8 月 8 日,那么在 2023 年 8 月 8 日(含)之前的每一天,该组合都应该有一行.例如日志表:
  4. \n
\n
| timestamp              | parent | child | status   |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-06-12 19:10:25-07 | p1     | c1    | new      |\n
Run Code Online (Sandbox Code Playgroud)\n

应该导致

\n
| date       | parent | child |\n| ---------- | ------ | ----- |\n| 2022-06-12 | p1     | c1    |\n| 2022-06-13 | p1     | c1    |\n| 2022-06-14 | p1     | c1    |\n.\n.\n.\n| 2023-08-08 | p1     | c1    | (<= today\'s date)\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 如果对于某些父/子组合只有一个已删除的事件,则该父/子组合应反映在已删除事件日期之前的几天。例如,我们可以定义时间范围为过去 30 天,因此父/子组合应出现在删除事件日期之前(包括删除事件日期)和之后的current_date()-30日期。例如,如果今天是8月8日,日志表如下:
  2. \n
\n
| timestamp              | parent | child | status   |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-07-11 19:10:25-07 | p1     | c1    | deleted  |\n
Run Code Online (Sandbox Code Playgroud)\n

那么预期的结果表应该是:

\n
| date       | parent | child |\n| ---------- | ------ | ----- |\n| 2022-07-08 | p1     | c1    |\n| 2022-07-09 | p1     | c1    |\n| 2022-07-10 | p1     | c1    |\n| 2022-07-11 | p1     | c1    |\n
Run Code Online (Sandbox Code Playgroud)\n

通用 SQL 解决方案会很棒,但如果不可能,那么 Spark SQL 也可以。

\n

Sri*_*vas 2

请在下面找到 Spark SQL

spark.sql("SELECT * FROM input").show(false)
+-----+------+--------+--------------------+
|child|parent|status  |timestamp           |
+-----+------+--------+--------------------+
|c1   |p1    |new     |2023-08-01T12:00:00Z|
|c1   |p1    |existing|2023-08-01T13:00:00Z|
|c1   |p1    |existing|2023-08-05T13:00:00Z|

|c2   |p2    |new     |2023-08-01T12:00:00Z|
|c2   |p2    |existing|2023-08-01T13:00:00Z|
|c2   |p2    |deleted |2023-08-05T13:00:00Z|

|c3   |p3    |new     |2023-08-01T08:00:00Z|
|c3   |p3    |existing|2023-08-01T13:00:00Z|
|c3   |p3    |deleted |2023-08-01T14:00:00Z|
|c3   |p3    |new     |2023-08-01T15:00:00Z|

|c4   |p4    |new     |2023-08-01T08:00:00Z|

|c1   |p5    |new     |2023-08-08T08:00:00Z|

|c6   |p6    |new     |2023-08-01T12:00:00Z|
|c6   |p6    |existing|2023-08-01T13:00:00Z|
|c6   |p6    |existing|2023-08-05T13:00:00Z|
+-----+------+--------+--------------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("""WITH input AS (
        SELECT
            CAST(timestamp AS DATE) AS dt, 
            parent,
            child,
            status,
            cast(MAX(timestamp) OVER(PARTITION BY parent, child) AS DATE) as max_ts,
            CASE WHEN LEAD(CAST(timestamp AS DATE), 1) OVER(PARTITION BY parent, child order by timestamp desc) IS NULL 
                 THEN CAST(timestamp AS DATE) 
                 ELSE LEAD(CAST(timestamp AS DATE), 1) OVER(PARTITION BY parent, child order by timestamp desc) 
            END as min_ts,
            row_number() OVER(PARTITION BY parent, child order by timestamp desc) as row_number
        FROM curd_data
    ),
    transformed_input AS (
        SELECT 
            dt,
            parent,
            child,
            max_ts,
            min_ts,
            row_number,
            CASE WHEN dt == max_ts AND (status == 'existing' OR status == 'new') THEN datediff(current_date, min_ts)
                 WHEN dt == max_ts AND status == 'deleted'  THEN datediff(date_sub(max_ts, 1), min_ts)
            END AS new_date_diff ,
            transform(
                sequence(
                        0,
                        CASE WHEN dt == max_ts AND (status == 'existing' OR status == 'new')
                            THEN datediff(current_date, min_ts) 
                        WHEN dt == max_ts AND status == 'deleted'
                            THEN datediff(date_sub(max_ts, 1), min_ts)                        
                        END
                ),
                sid -> date_add(min_ts, sid)                              
            ) as dates 
        FROM input WHERE row_number = 1
    ) 
    SELECT 
        distinct
        explode_outer(dates) AS date,
        parent,
        child
    FROM 
    transformed_input
    """).show(100, false)
// Exiting paste mode, now interpreting.

+----------+------+-----+
|date      |parent|child|
+----------+------+-----+
|2023-08-01|p1    |c1   |
|2023-08-02|p1    |c1   |
|2023-08-03|p1    |c1   |
|2023-08-04|p1    |c1   |
|2023-08-05|p1    |c1   |
|2023-08-06|p1    |c1   |
|2023-08-07|p1    |c1   |
|2023-08-08|p1    |c1   |

|2023-08-01|p2    |c2   |
|2023-08-02|p2    |c2   |
|2023-08-03|p2    |c2   |
|2023-08-04|p2    |c2   | 

|2023-08-01|p3    |c3   | 
|2023-08-02|p3    |c3   |
|2023-08-03|p3    |c3   |
|2023-08-04|p3    |c3   |
|2023-08-05|p3    |c3   |
|2023-08-06|p3    |c3   |
|2023-08-07|p3    |c3   |
|2023-08-08|p3    |c3   |

|2023-08-01|p4    |c4   |
|2023-08-02|p4    |c4   |
|2023-08-03|p4    |c4   |
|2023-08-04|p4    |c4   |
|2023-08-05|p4    |c4   |
|2023-08-06|p4    |c4   |
|2023-08-07|p4    |c4   |
|2023-08-08|p4    |c4   |

|2023-08-08|p5    |c1   |

|2023-08-01|p6    |c6   |
|2023-08-02|p6    |c6   |
|2023-08-03|p6    |c6   |
|2023-08-04|p6    |c6   |
|2023-08-05|p6    |c6   |
|2023-08-06|p6    |c6   |
|2023-08-07|p6    |c6   |
|2023-08-08|p6    |c6   |

+----------+------+-----+

Run Code Online (Sandbox Code Playgroud)