Yos*_*Yos 5 sql apache-spark-sql
有一个更改日志表,用于跟踪实体(父/子)及其状态(新/现有/已删除):
\n| timestamp | parent | child | status |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-06-10 19:10:25-07 | p1 | c1 | new |\n| 2022-06-12 19:10:25-07 | p1 | c1 | existing |\n| 2022-06-14 19:10:25-07 | p1 | c1 | deleted |\n| 2022-06-10 19:10:25-07 | p2 | c1 | new |\n| 2022-06-12 19:10:25-07 | p2 | c1 | deleted |\nRun Code Online (Sandbox Code Playgroud)\n当一个child实体处于状态时new,意味着它已被创建,当状态为时,existing意味着child保持存在,当状态为时deleted,意味着child不再相关。上表中的每一行都是由手动作业生成的,因此不能保证该作业会定期运行。这意味着具有existing状态的行是可选的new,而对于/deleted的每个组合肯定会出现 和。parentchild
我想将表转换为以下结构:
\n| date | parent | child |\n| ---------- | ------ | ----- |\n| 2022-06-10 | p1 | c1 |\n| 2022-06-11 | p1 | c1 |\n| 2022-06-12 | p1 | c1 |\n| 2022-06-13 | p1 | c1 |\n| 2022-06-10 | p2 | c1 |\n| 2022-06-11 | p2 | c1 |\n| 2022-06-10 | p2 | c1 |\nRun Code Online (Sandbox Code Playgroud)\n其中每个日期都包含该日期的活动parent/实体行。child
需要指出的是\xe2\x80\x99
\nnew某些父/子组合有一个事件,2022-06-12 19:10:25-07并且没有删除的事件,并且今天是 2023 年 8 月 8 日,那么在 2023 年 8 月 8 日(含)之前的每一天,该组合都应该有一行.例如日志表:| timestamp | parent | child | status |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-06-12 19:10:25-07 | p1 | c1 | new |\nRun Code Online (Sandbox Code Playgroud)\n应该导致
\n| date | parent | child |\n| ---------- | ------ | ----- |\n| 2022-06-12 | p1 | c1 |\n| 2022-06-13 | p1 | c1 |\n| 2022-06-14 | p1 | c1 |\n.\n.\n.\n| 2023-08-08 | p1 | c1 | (<= today\'s date)\nRun Code Online (Sandbox Code Playgroud)\ncurrent_date()-30日期。例如,如果今天是8月8日,日志表如下:| timestamp | parent | child | status |\n| ---------------------- | ------ | ----- | -------- |\n| 2022-07-11 19:10:25-07 | p1 | c1 | deleted |\nRun Code Online (Sandbox Code Playgroud)\n那么预期的结果表应该是:
\n| date | parent | child |\n| ---------- | ------ | ----- |\n| 2022-07-08 | p1 | c1 |\n| 2022-07-09 | p1 | c1 |\n| 2022-07-10 | p1 | c1 |\n| 2022-07-11 | p1 | c1 |\nRun Code Online (Sandbox Code Playgroud)\n通用 SQL 解决方案会很棒,但如果不可能,那么 Spark SQL 也可以。
\n请在下面找到 Spark SQL
spark.sql("SELECT * FROM input").show(false)
+-----+------+--------+--------------------+
|child|parent|status |timestamp |
+-----+------+--------+--------------------+
|c1 |p1 |new |2023-08-01T12:00:00Z|
|c1 |p1 |existing|2023-08-01T13:00:00Z|
|c1 |p1 |existing|2023-08-05T13:00:00Z|
|c2 |p2 |new |2023-08-01T12:00:00Z|
|c2 |p2 |existing|2023-08-01T13:00:00Z|
|c2 |p2 |deleted |2023-08-05T13:00:00Z|
|c3 |p3 |new |2023-08-01T08:00:00Z|
|c3 |p3 |existing|2023-08-01T13:00:00Z|
|c3 |p3 |deleted |2023-08-01T14:00:00Z|
|c3 |p3 |new |2023-08-01T15:00:00Z|
|c4 |p4 |new |2023-08-01T08:00:00Z|
|c1 |p5 |new |2023-08-08T08:00:00Z|
|c6 |p6 |new |2023-08-01T12:00:00Z|
|c6 |p6 |existing|2023-08-01T13:00:00Z|
|c6 |p6 |existing|2023-08-05T13:00:00Z|
+-----+------+--------+--------------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("""WITH input AS (
SELECT
CAST(timestamp AS DATE) AS dt,
parent,
child,
status,
cast(MAX(timestamp) OVER(PARTITION BY parent, child) AS DATE) as max_ts,
CASE WHEN LEAD(CAST(timestamp AS DATE), 1) OVER(PARTITION BY parent, child order by timestamp desc) IS NULL
THEN CAST(timestamp AS DATE)
ELSE LEAD(CAST(timestamp AS DATE), 1) OVER(PARTITION BY parent, child order by timestamp desc)
END as min_ts,
row_number() OVER(PARTITION BY parent, child order by timestamp desc) as row_number
FROM curd_data
),
transformed_input AS (
SELECT
dt,
parent,
child,
max_ts,
min_ts,
row_number,
CASE WHEN dt == max_ts AND (status == 'existing' OR status == 'new') THEN datediff(current_date, min_ts)
WHEN dt == max_ts AND status == 'deleted' THEN datediff(date_sub(max_ts, 1), min_ts)
END AS new_date_diff ,
transform(
sequence(
0,
CASE WHEN dt == max_ts AND (status == 'existing' OR status == 'new')
THEN datediff(current_date, min_ts)
WHEN dt == max_ts AND status == 'deleted'
THEN datediff(date_sub(max_ts, 1), min_ts)
END
),
sid -> date_add(min_ts, sid)
) as dates
FROM input WHERE row_number = 1
)
SELECT
distinct
explode_outer(dates) AS date,
parent,
child
FROM
transformed_input
""").show(100, false)
// Exiting paste mode, now interpreting.
+----------+------+-----+
|date |parent|child|
+----------+------+-----+
|2023-08-01|p1 |c1 |
|2023-08-02|p1 |c1 |
|2023-08-03|p1 |c1 |
|2023-08-04|p1 |c1 |
|2023-08-05|p1 |c1 |
|2023-08-06|p1 |c1 |
|2023-08-07|p1 |c1 |
|2023-08-08|p1 |c1 |
|2023-08-01|p2 |c2 |
|2023-08-02|p2 |c2 |
|2023-08-03|p2 |c2 |
|2023-08-04|p2 |c2 |
|2023-08-01|p3 |c3 |
|2023-08-02|p3 |c3 |
|2023-08-03|p3 |c3 |
|2023-08-04|p3 |c3 |
|2023-08-05|p3 |c3 |
|2023-08-06|p3 |c3 |
|2023-08-07|p3 |c3 |
|2023-08-08|p3 |c3 |
|2023-08-01|p4 |c4 |
|2023-08-02|p4 |c4 |
|2023-08-03|p4 |c4 |
|2023-08-04|p4 |c4 |
|2023-08-05|p4 |c4 |
|2023-08-06|p4 |c4 |
|2023-08-07|p4 |c4 |
|2023-08-08|p4 |c4 |
|2023-08-08|p5 |c1 |
|2023-08-01|p6 |c6 |
|2023-08-02|p6 |c6 |
|2023-08-03|p6 |c6 |
|2023-08-04|p6 |c6 |
|2023-08-05|p6 |c6 |
|2023-08-06|p6 |c6 |
|2023-08-07|p6 |c6 |
|2023-08-08|p6 |c6 |
+----------+------+-----+
Run Code Online (Sandbox Code Playgroud)