kha*_*rul 53 sql postgresql aggregate-functions
我正在使用count
并group by
获得每天注册的订阅者数量:
SELECT created_at, COUNT(email)
FROM subscriptions
GROUP BY created at;
Run Code Online (Sandbox Code Playgroud)
结果:
created_at count
-----------------
04-04-2011 100
05-04-2011 50
06-04-2011 50
07-04-2011 300
Run Code Online (Sandbox Code Playgroud)
我想每天获得累计订阅者总数.我怎么得到这个?
created_at count
-----------------
04-04-2011 100
05-04-2011 150
06-04-2011 200
07-04-2011 500
Run Code Online (Sandbox Code Playgroud)
int*_*tgr 89
对于较大的数据集,窗口函数是执行这些类型查询的最有效方式 - 表格将只扫描一次,而不是每个日期扫描一次,就像自联接一样.它看起来也简单得多.:) PostgreSQL 8.4及以上版本支持窗口功能.
这就是它的样子:
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM subscriptions
GROUP BY created_at;
Run Code Online (Sandbox Code Playgroud)
这里OVER
创建了窗口; ORDER BY created_at
意味着它必须按created_at
顺序总结计数.
编辑:如果您想在一天内删除重复的电子邮件,则可以使用sum(count(distinct email))
.不幸的是,这不会删除跨越不同日期的重复项.
如果你想删除所有重复项,我认为最简单的方法是使用子查询和DISTINCT ON
.这会将电子邮件归因于他们最早的日期(因为我按升序排序created_at,它会选择最早的日期):
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM (
SELECT DISTINCT ON (email) created_at, email
FROM subscriptions ORDER BY email, created_at
) AS subq
GROUP BY created_at;
Run Code Online (Sandbox Code Playgroud)
如果您创建索引(email, created_at)
,则此查询也不应太慢.
(如果要测试,这就是我创建样本数据集的方式)
create table subscriptions as
select date '2000-04-04' + (i/10000)::int as created_at,
'foofoobar@foobar.com' || (i%700000)::text as email
from generate_series(1,1000000) i;
create index on subscriptions (email, created_at);
Run Code Online (Sandbox Code Playgroud)
使用:
SELECT a.created_at,
(SELECT COUNT(b.email)
FROM SUBSCRIPTIONS b
WHERE b.created_at <= a.created_at) AS count
FROM SUBSCRIPTIONS a
Run Code Online (Sandbox Code Playgroud)