Pra*_*sas 9 sql database postgresql performance amazon-redshift
问题:
假设有一个简单(但很大)的表 foods
id name
-- -----------
01 ginger beer
02 white wine
03 red wine
04 ginger wine
Run Code Online (Sandbox Code Playgroud)
我想要计算有多少条目具有特定的硬编码模式,比如包含单词'ginger'(LIKE '%ginger%')或'wine'(LIKE '%wine%'),或其他任何内容,并将这些数字按照注释写入行.我正在寻找的结果如下
comment total
--------------- -----
contains ginger 2
for wine lovers 3
Run Code Online (Sandbox Code Playgroud)
解决方案1(格式正确但效率低):
可以使用UNION ALL和构造以下内容
SELECT * FROM
(
(
SELECT
'contains ginger' AS comment,
sum((name LIKE '%ginger%')::INT) AS total
FROM foods
)
UNION ALL
(
SELECT
'for wine lovers' AS comment,
sum((name LIKE '%wine%')::INT) AS total
FROM foods
)
)
Run Code Online (Sandbox Code Playgroud)
显然它的工作方式类似于简单地执行多个查询并在之后将它们缝合在一起.这是非常低效的.
解决方案2(高效但格式错误):
与以前的解决方案相比,以下速度快了许多倍
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods
Run Code Online (Sandbox Code Playgroud)
结果是
contains_ginger for_wine_lovers
--------------- ---------------
2 3
Run Code Online (Sandbox Code Playgroud)
所以绝对可以更快地获得相同的信息,但格式错误......
讨论:
什么是最好的整体方法?我该怎样做才能以有效的方式和更好的格式获得我想要的结果?还是真的不可能?
顺便说一下,我正在为Redshift编写这个(基于PostgreSQL).
谢谢.
选项 1:手动重塑形状
CREATE TEMPORARY TABLE wide AS (
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
...
FROM foods;
SELECT
'contains ginger', contains_ginger FROM wide
UNION ALL
SELECT
'for wine lovers', contains_wine FROM wine
UNION ALL
...;
Run Code Online (Sandbox Code Playgroud)
选项 2:创建类别表并使用联接
-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
SELECT 'contains ginger', 'ginger'
union all
SELECT 'for wine lovers', 'wine'
...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1
Run Code Online (Sandbox Code Playgroud)
由于您认为解决方案 2 足够快,因此选项 1应该适合您。
选项 2 也应该相当高效,并且更容易编写和扩展,并且作为额外的好处,此查询将让您知道给定类别中是否不存在食物。
选项 3:重塑并重新分布数据以更好地匹配分组键。
如果查询执行时间非常重要,您还可以预处理数据集。这样做的好处很大程度上取决于您的数据量和数据分布。您是否只有几个硬类别,还是会从某种界面动态搜索它们。
例如:
如果数据集像这样重塑:
content name
-------- ----
ginger 01
ginger 04
beer 01
white 02
wine 02
wine 04
wine 03
Run Code Online (Sandbox Code Playgroud)
然后,您可以在 上进行分片和分发content,并且每个实例都可以并行执行聚合的该部分。
这里,等效查询可能如下所示:
WITH content_count AS (
SELECT content, COUNT(*) total
FROM reshaped_food_table
GROUP BY 1
)
SELECT
CASE content
WHEN 'ginger' THEN 'contains ginger'
WHEN 'wine' THEN 'for wine lovers'
ELSE 'other'
END category
, total
FROM content_count
Run Code Online (Sandbox Code Playgroud)