Ale*_*rov 5 postgresql join window-functions postgresql-9.3
假设有两个表:
用户
id [pk] | name
--------+---------
1 | Alice
2 | Bob
3 | Charlie
4 | Dan
Run Code Online (Sandbox Code Playgroud)
电子邮件
id | user_id | email
----+---------+-------
1 | 1 | a.1
2 | 1 | a.2
3 | 2 | a.3
4 | 2 | b.1
5 | 2 | a.4
6 | 2 | a.5
7 | 3 | b.2
8 | 3 | a.6
Run Code Online (Sandbox Code Playgroud)
随着单查询我要检索:
我希望输出按电子邮件数量降序排列并过滤,仅包括以“a”开头的电子邮件。没有电子邮件的用户也应包括在内 - 将他们的电子邮件计数视为0。
有我的查询:
SELECT users.id AS user_id, users.name AS name,
emails.id AS email_id, emails.email AS email,
count(emails.id) OVER (PARTITION BY users.id) as n_emails
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC;
Run Code Online (Sandbox Code Playgroud)
而(预期的)结果,看起来不错:
user_id | name | email_id | email | n_emails
---------+---------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3
2 | Bob | 3 | a.3 | 3
1 | Alice | 2 | a.2 | 2
1 | Alice | 1 | a.1 | 2
3 | Charlie | 8 | a.6 | 1
Run Code Online (Sandbox Code Playgroud)
很明显,这是一个简单而小的例子,而实际数据集可能足够大,所以我想使用LIMIT/OFFSET进行分页。例如,我想获取第一对用户(不仅仅是行):
-- previous query ...
LIMIT 2 OFFSET 0;
Run Code Online (Sandbox Code Playgroud)
而且……失败了。我只有关于 Bob 的不完整信息:
user_id | name | email_id | email | n_emails
---------+------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3
Run Code Online (Sandbox Code Playgroud)
因此问题是:如何将限制/偏移应用于对象,在这种情况下,用户(逻辑实体,而不是行)?
我找到了这样的解决方案:添加dense_rank()users.id 然后按排名过滤:
SELECT * FROM (
SELECT users.id AS user_id, users.name AS name,
emails.id AS email_id, emails.email AS email,
count(emails.id) OVER (PARTITION BY users.id) as n_emails,
dense_rank() OVER (ORDER BY users.id) as n_user
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC
) AS sq
WHERE sq.n_user <= 2; -- here it is
Run Code Online (Sandbox Code Playgroud)
输出看起来不错:
user_id | name | email_id | email | n_emails | n_user
---------+-------+----------+-------+----------+--------
2 | Bob | 6 | a.5 | 3 | 2
2 | Bob | 5 | a.4 | 3 | 2
2 | Bob | 3 | a.3 | 3 | 2
1 | Alice | 2 | a.2 | 2 | 1
1 | Alice | 1 | a.1 | 2 | 1
Run Code Online (Sandbox Code Playgroud)
但是如果您查看查询计划,您会发现最昂贵的步骤是子查询扫描和排序。AFAIK 在子查询或 CTE 上建立索引是不可能的,所以它总是对 n_user 进行序列扫描/过滤,并且查询将在大数据集上执行很长时间。
我看到的另一个解决方案是进行两个查询:
查询是:
SELECT users.id AS user_id, users.name,
emails.id AS email_id, emails.email,
sq.n_emails
FROM
(SELECT users.id, count(emails.id) AS n_emails
FROM users
LEFT JOIN emails ON users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
GROUP BY users.id
ORDER BY n_emails DESC
LIMIT 2 OFFSET 0 -- here it is
) AS sq
JOIN users ON users.id = sq.id
LEFT JOIN emails ON emails.user_id = users.id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY sq.n_emails DESC;
Run Code Online (Sandbox Code Playgroud)
这似乎要快得多。但这看起来不是一个好的解决方案,因为我必须复制完全相同的查询(SELECT...FROM部分除外),事实上,一个查询运行了两次。有没有更好的解决办法?
假设我们只想要实际拥有电子邮件的用户。没有电子邮件的用户将被忽略。我一开始采用这个假设的原因是您的所有查询都已经这样做了:
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
Run Code Online (Sandbox Code Playgroud)
通过添加WHERE条件,emails.email您可以有效地将您的用户转换LEFT JOIN为普通[INNER] JOIN用户并排除没有电子邮件的用户。详细解释:
您的第二个查询并不像广告中那样工作,结果不是“按电子邮件数量降序”。您必须将 的结果嵌套count()在另一个 CTE 或子查询中并dense_rank()在其上运行。您不能在同一查询级别中嵌套窗口函数。
SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3
ORDER BY rnk;
Run Code Online (Sandbox Code Playgroud)
如果谓词选择性足够(仅选择所有电子邮件的一小部分),这应该是最快的。行排序不同的两个窗口函数也有其价格。
emails仅运行子查询 - 如果初步假设成立,这是可能的。另一方面,如果谓词WHERE e.email LIKE 'a' || '%'不是很有选择性,则您的第三个查询可能会更快,即使它从表中读取两次 - 但第二次只读取所需的行。还改进了:
SELECT e.user_id, u.name,
e.id AS e_id, e.e_mail, sq.n_emails
FROM (
SELECT user_id, count(*) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%'
GROUP BY user_id
ORDER BY count(*) DESC, user_id -- break ties
LIMIT 2 OFFSET 0
) sq
JOIN emails e USING (user_id)
JOIN users u ON u.id = e.user_id
WHERE e.email LIKE 'a' || '%'
ORDER BY sq.n_emails DESC;
Run Code Online (Sandbox Code Playgroud)
您可以再次将用户表包含在内部查询中,类似于之前的操作。但您必须将电子邮件过滤器拉入加入条件!
SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT u.id AS user_id, u.name, e.id AS e_id
, count(e.user_id) OVER (PARTITION BY u.id) AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%' -- !!!
) e1
) e2
WHERE rnk < 3
ORDER BY rnk;
Run Code Online (Sandbox Code Playgroud)
哪个会贵一点。
由于您首先检索电子邮件最多的用户,因此没有电子邮件的用户很少会出现在结果中。为了优化性能,您可以使用UNION ALLwith LIMIT:
( -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3 -- adapt to paging!
ORDER BY rnk
)
UNION ALL
(
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%'
WHERE e.e.user_id IS NULL
)
OFFSET 0 -- adapt to paging!
LIMIT 2 -- adapt to paging!
Run Code Online (Sandbox Code Playgroud)
详细解释:
MATERIALIZED VIEW我会考虑实现这一结果有两个原因:
从第二个LIMIT不带( )的查询构建 MV REFRESH MATERIALIZED VIEW,然后返回第一页等。当您再次刷新 MV 时,这是一个策略问题。
| 归档时间: |
|
| 查看次数: |
223 次 |
| 最近记录: |