获取连接表中聚合值的增量计数

Art*_*ode 10 mysql aggregate mysql-5.7

我在 MySQL 5.7.22 数据库中有两个表:postsreasons. 每个帖子行都有并属于许多原因行。每个原因都有一个与之相关的权重,因此每个帖子都有一个与之相关的聚合权重。

对于每增加 10 个权重点(即 0、10、20、30 等),我想获得总权重小于或等于该增量的帖子数。我希望结果看起来像这样:

 weight | post_count
--------+------------
      0 | 0
     10 | 5
     20 | 12
     30 | 18
    ... | ...
    280 | 20918
    290 | 21102
    ... | ...
   1250 | 118005
   1260 | 118039
   1270 | 118040
Run Code Online (Sandbox Code Playgroud)

总权重近似正态分布,有一些非常低的值和一些非常高的值(目前最大值为 1277),但大多数在中间。中有不到 120,000 行posts,大约有 120行reasons。每个帖子平均有 5 到 6 个理由。

表的相关部分如下所示:

CREATE TABLE `posts` (
  id BIGINT PRIMARY KEY
);

CREATE TABLE `reasons` (
  id BIGINT PRIMARY KEY,
  weight INT(11) NOT NULL
);

CREATE TABLE `posts_reasons` (
  post_id BIGINT NOT NULL,
  reason_id BIGINT NOT NULL,
  CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
  CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);
Run Code Online (Sandbox Code Playgroud)

到目前为止,我已经尝试将帖子 ID 和重量放入一个视图中,然后将该视图加入到自身中以获得聚合计数:

CREATE VIEW `post_weights` AS (
    SELECT 
        posts.id,
        SUM(reasons.weight) AS reason_weight
    FROM posts
    INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
    INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
    GROUP BY posts.id
);

SELECT
    FLOOR(p1.reason_weight / 10) AS weight,
    COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;
Run Code Online (Sandbox Code Playgroud)

然而,这是非常缓慢的——我让它运行了 15 分钟而没有终止,这在生产中是无法做到的。

有没有更有效的方法来做到这一点?

如果您有兴趣测试整个数据集,可在此处下载。该文件大约 60MB,它扩展到大约 250MB。或者,此处的 GitHub 要点中有 12,000 行。

Len*_*art 8

在 JOIN 条件中使用函数或表达式通常是一个坏主意,我说通常是因为一些优化器可以很好地处理它并且无论如何都可以利用索引。我建议为权重创建一个表格。就像是:

CREATE TABLE weights
( weight int not null primary key 
);

INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);
Run Code Online (Sandbox Code Playgroud)

确保您有索引posts_reasons

CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);
Run Code Online (Sandbox Code Playgroud)

像这样的查询:

SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;
Run Code Online (Sandbox Code Playgroud)

我家里的机器可能已经有 5-6 年的历史了,它有一个 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 和 8Gb 内存。

uname -a Linux 垃圾 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

我测试了:

https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5

MariaDB [test3]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)


SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

+--------+------------+
| weight | post_count |
+--------+------------+
|      0 |          1 |
|     10 |       2591 |
|     20 |       4264 |
|     30 |       4386 |
|     40 |       5415 |
|     50 |       7499 |
[...]   
|   1270 |     119283 |
|   1320 |     119286 |
|   1330 |     119286 |
[...]
|   2590 |     119286 |
+--------+------------+
256 rows in set (9.89 sec)
Run Code Online (Sandbox Code Playgroud)

如果性能至关重要并且没有其他帮助,您可以为以下各项创建汇总表:

SELECT pr.post_id, SUM(r.weight) as sum_weight     
FROM reasons r
JOIN posts_reasons pr
    ON r.id = pr.reason_id
GROUP BY pr.post_id
Run Code Online (Sandbox Code Playgroud)

您可以通过触发器维护此表

由于对于 weights 中的每个 weight 都需要做一定的工作量,因此限制此表可能是有益的。

    ON w.weight > x.sum_weight 
WHERE w.weight <= (select MAX(sum_weights) 
                   from (SELECT SUM(weight) as sum_weights 
                   FROM reasons r        
                   JOIN posts_reasons pr
                       ON r.id = pr.reason_id 
                   GROUP BY pr.post_id) a
                  ) 
GROUP BY w.weight
Run Code Online (Sandbox Code Playgroud)

由于我的权重表中有很多不需要的行(最大 2590),上述限制将执行时间从 9 秒缩短到 4 秒。


And*_*y M 7

在 MySQL 中,变量可以在查询中使用,既可以从列中的值计算,也可以用于新的计算列的表达式。在这种情况下,使用变量会产生高效的查询:

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0) AS x,
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      (
        SELECT 
          p.id,
          SUM(r.weight) AS reason_weight
        FROM
          posts AS p
          INNER JOIN posts_reasons AS pr ON p.id = pr.post_id
          INNER JOIN reasons AS r ON pr.reason_id = r.id
        GROUP BY
          p.id
      ) AS d
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;
Run Code Online (Sandbox Code Playgroud)

d派生表实际上是你的post_weights看法。因此,如果您打算保留视图,则可以使用它代替派生表:

SELECT
  weight,
  @cumulative := @cumulative + post_count AS post_count
FROM
  (SELECT @cumulative := 0),
  (
    SELECT
      FLOOR(reason_weight / 10) * 10 AS weight,
      COUNT(*)                       AS post_count
    FROM
      post_weights
    GROUP BY
      FLOOR(reason_weight / 10)
    ORDER BY
      FLOOR(reason_weight / 10) ASC
  ) AS derived
;
Run Code Online (Sandbox Code Playgroud)

此解决方案的演示使用简化版的安装程序,可以在 SQL Fiddle 中找到和使用。