如何找到最长的连续日期序列?

Far*_*rah 4 apache-spark apache-spark-sql

我有一个像这样在时间戳中进行时间访问的数据库

ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
Run Code Online (Sandbox Code Playgroud)

我使用 spark SQL,我需要为每个 ID 设置最长的连续日期序列,例如

ID, longest_seq (days)
1, 2
2, 5
3, 1
Run Code Online (Sandbox Code Playgroud)

我试图根据 我的情况调整这个答案使用 SQL 检测连续日期范围,但我没有达到我的期望。

 SELECT ID, MIN (d), MAX(d)
    FROM (
      SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d, 
                ROW_NUMBER() OVER(
         PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST') 
                                                           as date)) rn
      FROM purchase
      where ID is not null
      GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) 
    )
    GROUP BY ID, rn
    ORDER BY ID
Run Code Online (Sandbox Code Playgroud)

如果有人对如何解决此请求或其中有什么问题有一些线索,我将不胜感激 谢谢

[编辑] 更明确的输入/输出

ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
Run Code Online (Sandbox Code Playgroud)

结果是:

ID, MaxSeq (in days)
1,3
2,3
3,1
Run Code Online (Sandbox Code Playgroud)

所有访问都在时间戳中,但我需要连续几天,然后每天的每次访问都按天计算一次

小智 6

我下面的答案改编自https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even用于 Spark SQL。您将使用以下内容包装 SQL 查询:

spark.sql("""
SQL_QUERY
""")
Run Code Online (Sandbox Code Playgroud)

因此,对于第一个查询:

CREATE TABLE intermediate_1 AS
SELECT 
    id,
    time,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
    time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
Run Code Online (Sandbox Code Playgroud)

这会给你:

id, time, rn, grp
1,  1,    1,  0
1,  2,    2,  0
1,  3,    3,  0
2,  1,    1,  0
2,  3,    2,  1
2,  4,    3,  1
2,  5,    4,  1
2,  10,   5,  5
2,  11,   6,  5
3,  1,    1,  0
3,  4,    2,  2
3,  9,    3,  6
3,  11,   4,  7
Run Code Online (Sandbox Code Playgroud)

我们可以看到连续的行具有相同的 grp 值。然后我们将使用 GROUP BY 和 COUNT 来获取连续时间的数量。

CREATE TABLE intermediate_2 AS
SELECT 
    id,
    grp,
    COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
Run Code Online (Sandbox Code Playgroud)

这将返回:

id, grp, num_consecutive
1,  0,   3
2,  0,   1
2,  1,   3
2,  5,   2
3,  0,   1
3,  2,   1
3,  6,   1
3,  7,   1
Run Code Online (Sandbox Code Playgroud)

现在我们只使用 MAX 和 GROUP BY 来获取最大连续时间。

CREATE TABLE final AS
SELECT 
    id,
    MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Run Code Online (Sandbox Code Playgroud)

这会给你:

id, max_consecutive
1,  3
2,  3
3,  1
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助!