dea*_*unk 6 sql vertica window-functions
有一个访问数据表:
uid (INT) | created_at (DATETIME)
Run Code Online (Sandbox Code Playgroud)
我想查找用户连续多少天访问过我们的应用.例如:
SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123
Run Code Online (Sandbox Code Playgroud)
将返回:
d
------------
2012-04-28
2012-04-29
2012-04-30
2012-05-03
2012-05-04
Run Code Online (Sandbox Code Playgroud)
有5个记录和两个间隔 - 3天(4月28日至30日)和2天(5月3日至4日).
我的问题是如何找到用户连续访问应用程序的最大天数(示例中为3天).试图在SQL文档中找到合适的函数,但没有成功.我错过了什么吗?
UPD: 谢谢你们的回答!实际上,我正在使用vertica分析数据库(http://vertica.com/),但这是一个非常罕见的解决方案,只有少数人有使用它的经验.虽然它支持SQL-99标准.
那么,大多数解决方案只需稍作修改即可.最后我创建了自己的查询版本:
-- returns starts of the vitit series
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
s
---------------------
2012-04-28 01:00:00
2012-05-03 01:00:00
-- returns end of the vitit series
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
f
---------------------
2012-04-30 01:00:00
2012-05-04 01:00:00
Run Code Online (Sandbox Code Playgroud)
所以现在我们只需要以某种方式加入它们,例如通过行索引.
SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2
Run Code Online (Sandbox Code Playgroud)
样本输出:
s | f | seq
---------------------+---------------------+-----
2012-04-28 01:00:00 | 2012-04-30 01:00:00 | 3
2012-05-03 01:00:00 | 2012-05-04 01:00:00 | 2
Run Code Online (Sandbox Code Playgroud)
另一种方法,最短,做自我加入:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)
Run Code Online (Sandbox Code Playgroud)
输出:
d | group_number | consecutive_days
---------------------+--------------+------------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
Run Code Online (Sandbox Code Playgroud)
现场测试:http://www.sqlfiddle.com/#!1/93789/1
sr =第二行,fr =第一行(或者可能是前一行?ツ).基本上我们正在进行反向跟踪,这是一个不支持的数据库模拟延迟LAG(Postgres支持LAG,但解决方案很长,因为窗口不支持嵌套窗口).所以在这个查询中,我们使用混合方法,通过join模拟LAG,然后对它使用SUM窗口,这会产生组号
UPDATE
忘了把最后的查询,上面的查询说明了组编号的基础,需要将其变形为:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1
STARTING_DATE END_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
Run Code Online (Sandbox Code Playgroud)
UPDATE
我知道为什么我使用窗口函数的其他解决方案变得很长,我试图说明组编号的逻辑和计数组的时间变得很长.如果我像我的MySql方法一样切入追逐,那么窗口函数可能会更短.话虽如此,这是我的旧窗口函数方法,虽然现在更好:
with headers as
(
select
d,lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over (order by d) as group_number
from headers
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1
Run Code Online (Sandbox Code Playgroud)
现场测试:http://www.sqlfiddle.com/#!1/93789/21
| 归档时间: |
|
| 查看次数: |
3933 次 |
| 最近记录: |