按引用表中的相关行数排序

Ale*_*rov 5 postgresql join window-functions postgresql-9.3

假设有两个表:

用户

id [pk] |   name
--------+---------
      1 | Alice
      2 | Bob
      3 | Charlie
      4 | Dan
Run Code Online (Sandbox Code Playgroud)

电子邮件

 id | user_id | email 
----+---------+-------
  1 |       1 | a.1
  2 |       1 | a.2
  3 |       2 | a.3
  4 |       2 | b.1
  5 |       2 | a.4
  6 |       2 | a.5
  7 |       3 | b.2
  8 |       3 | a.6
Run Code Online (Sandbox Code Playgroud)

随着查询我要检索:

  • 用户的 ID 和名称
  • 用户的电子邮件数
  • 用户的电子邮件及其 ID

我希望输出按电子邮件数量降序排列并过滤,仅包括以“a”开头的电子邮件。没有电子邮件的用户也应包括在内 - 将他们的电子邮件计数视为0

有我的查询:

SELECT users.id AS user_id, users.name AS name,
       emails.id AS email_id, emails.email AS email,
       count(emails.id) OVER (PARTITION BY users.id) as n_emails
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC;
Run Code Online (Sandbox Code Playgroud)

而(预期的)结果,看起来不错:

 user_id |  name   | email_id | email | n_emails 
---------+---------+----------+-------+----------
       2 | Bob     |        6 | a.5   |        3
       2 | Bob     |        5 | a.4   |        3
       2 | Bob     |        3 | a.3   |        3
       1 | Alice   |        2 | a.2   |        2
       1 | Alice   |        1 | a.1   |        2
       3 | Charlie |        8 | a.6   |        1
Run Code Online (Sandbox Code Playgroud)

很明显,这是一个简单而小的例子,而实际数据集可能足够大,所以我想使用LIMIT/OFFSET进行分页。例如,我想获取第一对用户(不仅仅是行):

-- previous query ...
LIMIT 2 OFFSET 0;
Run Code Online (Sandbox Code Playgroud)

而且……失败了。我只有关于 Bob 的不完整信息:

 user_id | name | email_id | email | n_emails 
---------+------+----------+-------+----------
       2 | Bob  |        6 | a.5   |        3
       2 | Bob  |        5 | a.4   |        3
Run Code Online (Sandbox Code Playgroud)

因此问题是:如何将限制/偏移应用于对象,在这种情况下,用户(逻辑实体,而不是行)

我找到了这样的解决方案:添加dense_rank()users.id 然后按排名过滤:

SELECT * FROM (
    SELECT users.id AS user_id, users.name AS name,
           emails.id AS email_id, emails.email AS email,
           count(emails.id) OVER (PARTITION BY users.id) as n_emails,
           dense_rank() OVER (ORDER BY users.id) as n_user
    FROM users
    LEFT JOIN emails on users.id = emails.user_id
    WHERE emails.email LIKE 'a' || '%%'
    ORDER BY n_emails DESC
    ) AS sq
WHERE sq.n_user <= 2; -- here it is
Run Code Online (Sandbox Code Playgroud)

输出看起来不错:

 user_id | name  | email_id | email | n_emails | n_user 
---------+-------+----------+-------+----------+--------
       2 | Bob   |        6 | a.5   |        3 |      2
       2 | Bob   |        5 | a.4   |        3 |      2
       2 | Bob   |        3 | a.3   |        3 |      2
       1 | Alice |        2 | a.2   |        2 |      1
       1 | Alice |        1 | a.1   |        2 |      1
Run Code Online (Sandbox Code Playgroud)

但是如果您查看查询计划,您会发现最昂贵的步骤是子查询扫描和排序。AFAIK 在子查询或 CTE 上建立索引是不可能的,所以它总是对 n_user 进行序列扫描/过滤,并且查询将在大数据集上执行很长时间。

我看到的另一个解决方案是进行两个查询:

  1. 使用子查询仅检索过滤和排序数据集的用户 ID 和电子邮件数量;
  2. 将第一个子查询加入用户和电子邮件

查询是:

SELECT users.id AS user_id, users.name,
       emails.id AS email_id, emails.email,
       sq.n_emails
FROM
(SELECT users.id, count(emails.id) AS n_emails
    FROM users
    LEFT JOIN emails ON users.id = emails.user_id
    WHERE emails.email LIKE 'a' || '%%'
    GROUP BY users.id
    ORDER BY n_emails DESC
    LIMIT 2 OFFSET 0 -- here it is
    ) AS sq
JOIN users ON users.id = sq.id
LEFT JOIN emails ON emails.user_id = users.id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY sq.n_emails DESC;
Run Code Online (Sandbox Code Playgroud)

这似乎要快得多。但这看起来不是一个好的解决方案,因为我必须复制完全相同的查询(SELECT...FROM部分除外),事实上,一个查询运行了两次。有没有更好的解决办法?

Erw*_*ter 2

排除没有电子邮件的用户

假设我们只想要实际拥有电子邮件的用户。没有电子邮件的用户将被忽略。我一开始采用这个假设的原因是您的所有查询都已经这样做了

LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
Run Code Online (Sandbox Code Playgroud)

通过添加WHERE条件,emails.email您可以有效地将您的用户转换LEFT JOIN为普通[INNER] JOIN用户并排除没有电子邮件的用户。详细解释:

重写第二个查询

您的第二个查询并不广告中那样工作,结果不是“按电子邮件数量降序”。您必须将 的结果嵌套count()在另一个 CTE 或子查询中并dense_rank()在其上运行。您不能在同一查询级别中嵌套窗口函数。

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3
ORDER  BY rnk;
Run Code Online (Sandbox Code Playgroud)

如果谓词选择性足够(仅选择所有电子邮件的一小部分),这应该是最快的。行排序不同的两个窗口函数也有其价格。

  • 要点是emails仅运行子查询 - 如果初步假设成立,这是可能的。

第三个查询得到改进

另一方面,如果谓词WHERE e.email LIKE 'a' || '%'不是很有选择性,则您的第三个查询可能会更快,即使它从表中读取两次 - 但第二次只读取所需的行。还改进了:

SELECT e.user_id, u.name,
       e.id AS e_id, e.e_mail, sq.n_emails
FROM  (
   SELECT user_id, count(*) AS n_emails
   FROM   emails
   WHERE  email LIKE 'a' || '%'
   GROUP  BY user_id
   ORDER  BY count(*) DESC, user_id  -- break ties
   LIMIT  2  OFFSET 0
   ) sq
JOIN   emails e USING (user_id)
JOIN   users  u ON u.id = e.user_id
WHERE  e.email LIKE 'a' || '%'
ORDER  BY sq.n_emails DESC;
Run Code Online (Sandbox Code Playgroud)

包括没有电子邮件的用户

您可以再次将用户表包含在内部查询中,类似于之前的操作。但您必须将电子邮件过滤器拉入加入条件!

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT u.id AS user_id, u.name, e.id AS e_id
           , count(e.user_id) OVER (PARTITION BY u.id) AS n_emails          
      FROM   users u
      LEFT   JOIN emails e ON e.user_id = u.id
                          AND e.email LIKE 'a' || '%'  -- !!!
      ) e1
   ) e2
WHERE  rnk < 3
ORDER  BY rnk;
Run Code Online (Sandbox Code Playgroud)

哪个会贵一点。

由于您首先检索电子邮件最多的用户,因此没有电子邮件的用户很少会出现在结果中。为了优化性能,您可以使用UNION ALLwith LIMIT

(  -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3      -- adapt to paging!
ORDER  BY rnk
)
UNION ALL
(    
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails  
FROM   users       u
LEFT   JOIN emails e ON e.user_id = u.id
                    AND e.email LIKE 'a' || '%'
WHERE  e.e.user_id IS NULL
)
OFFSET 0      -- adapt to paging!
LIMIT  2      -- adapt to paging!
Run Code Online (Sandbox Code Playgroud)

详细解释:

MATERIALIZED VIEW

我会考虑实现这一结果有两个原因:

  • 后续查询速度要快得多。
  • 您不必对移动目标进行操作。您谈到分页,如果用户在页面之间收到新电子邮件,则您的整个排序顺序可能没有实际意义。

从第二个LIMIT不带( )的查询构建 MV REFRESH MATERIALIZED VIEW,然后返回第一页等。当您再次刷新 MV 时,这是一个策略问题。