Far*_*rah 4 apache-spark apache-spark-sql
我有一个像这样在时间戳中进行时间访问的数据库
ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
Run Code Online (Sandbox Code Playgroud)
我使用 spark SQL,我需要为每个 ID 设置最长的连续日期序列,例如
ID, longest_seq (days)
1, 2
2, 5
3, 1
Run Code Online (Sandbox Code Playgroud)
我试图根据 我的情况调整这个答案使用 SQL 检测连续日期范围,但我没有达到我的期望。
SELECT ID, MIN (d), MAX(d)
FROM (
SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d,
ROW_NUMBER() OVER(
PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST')
as date)) rn
FROM purchase
where ID is not null
GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date)
)
GROUP BY ID, rn
ORDER BY ID
Run Code Online (Sandbox Code Playgroud)
如果有人对如何解决此请求或其中有什么问题有一些线索,我将不胜感激 谢谢
[编辑] 更明确的输入/输出
ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
Run Code Online (Sandbox Code Playgroud)
结果是:
ID, MaxSeq (in days)
1,3
2,3
3,1
Run Code Online (Sandbox Code Playgroud)
所有访问都在时间戳中,但我需要连续几天,然后每天的每次访问都按天计算一次
小智 6
我下面的答案改编自https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even用于 Spark SQL。您将使用以下内容包装 SQL 查询:
spark.sql("""
SQL_QUERY
""")
Run Code Online (Sandbox Code Playgroud)
因此,对于第一个查询:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
Run Code Online (Sandbox Code Playgroud)
这会给你:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
Run Code Online (Sandbox Code Playgroud)
我们可以看到连续的行具有相同的 grp 值。然后我们将使用 GROUP BY 和 COUNT 来获取连续时间的数量。
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
Run Code Online (Sandbox Code Playgroud)
这将返回:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
Run Code Online (Sandbox Code Playgroud)
现在我们只使用 MAX 和 GROUP BY 来获取最大连续时间。
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Run Code Online (Sandbox Code Playgroud)
这会给你:
id, max_consecutive
1, 3
2, 3
3, 1
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助!