生成按列分组的值的直方图

mar*_*oyo 5 postgresql histogram

reviews对于某些项目集,我在表格中有以下数据,使用的分数系统范围为0到100

+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1         | 1       | 90    |
+-----------+---------+-------+
| 2         | 1       | 40    |
+-----------+---------+-------+
| 3         | 1       | 10    |
+-----------+---------+-------+
| 4         | 2       | 90    |
+-----------+---------+-------+
| 5         | 2       | 90    |
+-----------+---------+-------+
| 6         | 2       | 70    |
+-----------+---------+-------+
| 7         | 3       | 80    |
+-----------+---------+-------+
| 8         | 3       | 80    |
+-----------+---------+-------+
| 9         | 3       | 80    |
+-----------+---------+-------+
| 10        | 3       | 80    |
+-----------+---------+-------+
| 11        | 4       | 10    |
+-----------+---------+-------+
| 12        | 4       | 30    |
+-----------+---------+-------+
| 13        | 4       | 50    |
+-----------+---------+-------+
| 14        | 4       | 80    |
+-----------+---------+-------+
Run Code Online (Sandbox Code Playgroud)

我正在尝试使用bin大小为5创建得分值的直方图.我的目标是为每个项目生成一个直方图.为了创建整个表的直方图,可以使用width_bucket.这也可以调整为按项目操作:

SELECT item_id, g.n as bucket, COUNT(m.score) as count 
FROM generate_series(1, 5) g(n) LEFT JOIN
     review as m
     ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
Run Code Online (Sandbox Code Playgroud)

但是,结果如下所示:

+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1       | 5      | 1     |
+---------+--------+-------+
| 1       | 3      | 1     |
+---------+--------+-------+
| 1       | 1      | 1     |
+---------+--------+-------+
| 2       | 5      | 2     |
+---------+--------+-------+
| 2       | 4      | 2     |
+---------+--------+-------+
| 3       | 4      | 4     |
+---------+--------+-------+
| 4       | 1      | 1     |
+---------+--------+-------+
| 4       | 2      | 1     |
+---------+--------+-------+
| 4       | 3      | 1     |
+---------+--------+-------+
| 4       | 4      | 1     |
+---------+--------+-------+
Run Code Online (Sandbox Code Playgroud)

也就是说,不包括没有条目的箱子.虽然我发现这不是一个糟糕的解决方案,但我宁愿拥有所有桶,在没有条目的情况下为0.更好的是,使用这种结构:

+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1       | 1        | 0        | 1        | 0        | 1        |
+---------+----------+----------+----------+----------+----------+
| 2       | 0        | 0        | 0        | 2        | 2        |
+---------+----------+----------+----------+----------+----------+
| 3       | 0        | 0        | 0        | 4        | 0        |
+---------+----------+----------+----------+----------+----------+
| 4       | 1        | 1        | 1        | 1        | 0        |
+---------+----------+----------+----------+----------+----------+
Run Code Online (Sandbox Code Playgroud)

我更喜欢这个解决方案,因为它每个项目使用一行(而不是5n),这样可以更简单地查询并最大限度地减少内存消耗和数据传输成本.我目前的做法如下:

select item_id, 
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Run Code Online (Sandbox Code Playgroud)

尽管这个查询满足了我的要求,但我很想知道是否有更优雅的方法.如此多的case语句不容易阅读,并且bin标准的更改可能需要更新每个总和.此外,我对此查询可能存在的潜在性能问题感到好奇.

小智 4

可以重写第二个查询以使用范围,以使编辑和编写查询更容易一些:

with buckets (b1, b2, b3, b4, b5) as (
  values ( 
     int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100) 
  )
)
select item_id, 
       count(*) filter (where b1 @> score) as bucket_1,
       count(*) filter (where b2 @> score) as bucket_2,
       count(*) filter (where b3 @> score) as bucket_3,
       count(*) filter (where b4 @> score) as bucket_4,
       count(*) filter (where b5 @> score) as bucket_5
from review 
  cross join buckets
group by item_id
order by item_id;
Run Code Online (Sandbox Code Playgroud)

用 构造的范围int4range(0,20)包括下端并排除上端。

命名的CTEbuckets创建一行,因此交叉联接不会更改表中的行数review