chr*_*g89 3 sql postgresql date aggregate-functions postgresql-9.3
我每天都在努力寻找#活跃用户.
用户在连续 4 周每周发出超过 10个请求时处于活动状态.
即.2014年10月31日,如果用户每周总共发出超过10个请求,则用户处于活动状态:
我有一张桌子requests:
CREATE TABLE requests (
id text PRIMARY KEY, -- id of the request
amount bigint, -- sum of requests made by accounts_id to recipient_id,
-- aggregated on a daily basis based on "date"
accounts_id text, -- id of the user
recipient_id text, -- id of the recipient
date timestamp -- date that the request was made in YYYY-MM-DD
);
Run Code Online (Sandbox Code Playgroud)
样本值:
INSERT INTO requests2
VALUES
('1', 19, 'a1', 'b1', '2014-10-05 00:00:00'),
('2', 19, 'a2', 'b2', '2014-10-06 00:00:00'),
('3', 85, 'a3', 'b3', '2014-10-07 00:00:00'),
('4', 11, 'a1', 'b4', '2014-10-13 00:00:00'),
('5', 2, 'a2', 'b5', '2014-10-14 00:00:00'),
('6', 50, 'a3', 'b5', '2014-10-15 00:00:00'),
('7', 787323, 'a1', 'b6', '2014-10-17 00:00:00'),
('8', 33, 'a2', 'b8', '2014-10-18 00:00:00'),
('9', 14, 'a3', 'b9', '2014-10-19 00:00:00'),
('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
Run Code Online (Sandbox Code Playgroud)
示例输出:
Date | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891
Run Code Online (Sandbox Code Playgroud)
以下是我尝试查找特定日期的活跃用户数(例如10-01-2014):
SELECT count(*)
FROM
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
Run Code Online (Sandbox Code Playgroud)
由于这只是获取1天数的查询,因此我需要每天获得此数字.我认为这个想法是做一个连接来获取日期,所以我尝试做这样的事情:
SELECT week_1."Date_series",
count(*)
FROM
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"
Run Code Online (Sandbox Code Playgroud)
但是,我认为我没有得到正确答案,我不确定为什么.任何提示/指导/指针非常感谢!:) :)
PS.我正在使用Postgres 9.3
这是一个很长的答案,如何使您的查询简短.:)
建立在我的表上(在您使用不同的(奇数!)数据类型提供表定义之前:
CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);
Run Code Online (Sandbox Code Playgroud)
某一天的"活跃用户"列表:
SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;
Run Code Online (Sandbox Code Playgroud)
在最里面的子查询中w(对于"周"),从CROSS JOIN给定日期的1中构建感兴趣的4周的界限- 输出为1 generate_series(0-3).
要向/从date(不是从时间戳!)添加/减去天integer数,只需添加/减去数字.该表达式day - 7 * w从给定日期开始减去0-3次7天,到达每周的结束日期(w_end).
从每个中减去另外6天(不是7!)以计算相应的start(w_start).
另外,保留w后期聚合的周数(0-3).
在子查询subrequests中将行连接到4周的集合,其中日期位于开始日期和结束日期之间.GROUP BY周数w和accounts_id.
只有超过10个请求的周数才符合条件.
在外部SELECT计数中,每个user(accounts_id)限定的周数.必须是4才有资格成为"活跃用户"
这是炸药.
包含在一个简单的SQL函数中以简化一般用法,但查询也可以单独使用:
CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;
Run Code Online (Sandbox Code Playgroud)
该函数_now默认使用任何day(),"today",以及_days结果中的days()数,默认为3.呼叫:
SELECT * FROM f_active_users('2014-10-31', 5);
Run Code Online (Sandbox Code Playgroud)
或者没有参数来使用默认值:
SELECT * FROM f_active_users();
Run Code Online (Sandbox Code Playgroud)
该方法与第一个查询不同.
SQL为您的表定义提供查询和变体.
仅在感兴趣期间的CTE r预聚合金额中(accounts_id, date),以获得更好的绩效.该表仅扫描一次,建议的索引(见打击)将在此处启动.
在内部子查询中d生成必要的天数列表:27 + _days行,其中_days是输出中所需的行数,有效期为28天或更长.
在它的同时,计算dow在步骤3中用于聚合的星期几().i%7与每周间隔一致,查询适用于任何间隔.
在内部子查询中,a生成accounts_idCTE中存在的唯一user()列表,r并传递一些第一个表面测试(足够的行跨越足够的时间并有足够的总请求).
从每个相关用户的每个相关日生成一个笛卡尔积,d并a与a CROSS JOIN一起生成一行.LEFT JOIN到r追加请求的量(如果有的话).没有WHERE条件,我们希望结果中的每一天,即使根本没有活跃的用户.
w_amount使用带有自定义框架的Window函数,在同一步骤中计算过去一周()的总金额.例:
现在切断了最后6天; 这是可选的,可能会或可能不会有助于提高性能.测试一下:WHERE date >= _now - (21 + _days)
w_ct在类似的窗口函数中计算满足最小量的周数(),此时除以dow在帧中过去4周仅具有相同的工作日(其携带相应的过去一周的总和).表达式count(w_amount > 10 OR NULL)仅计算超过10个请求的行.详细说明:
在外部SELECTgroup by date和count用户通过了所有4周(count(w_ct = 4 OR NULL)).在日期中添加1以补偿1 分钟ORDER和LIMIT所请求的天数.
两个查询的完美索引是:
CREATE INDEX foo ON requests (date, accounts_id, amount);
Run Code Online (Sandbox Code Playgroud)
由于新的移动聚合支持,性能应该很好,但是即将推出的Postgres 9.4会更好(更好):
Postgres Wiki中的移动聚合支持.
在9.4手册中移动聚合
旁白:不要将timestamp列称为"日期",它是a timestamp,而不是a date.更好的是,永远不要使用基本类型名称date或timestamp标识符.永远.
| 归档时间: |
|
| 查看次数: |
1301 次 |
| 最近记录: |