Z. *_* M. 6 sql postgresql query-optimization
我有一个包含数亿行的表,我想从同一个表的 2 个索引列中获取唯一值的单个列表(没有唯一的行 ID)。
为了说明这一点,假设我们有一个包含一fruits列和一veggies列的表,我想构建一个healthy_foods包含两列中唯一值的列表。
我尝试过以下查询:
与联盟
WITH cte as (
SELECT fruit, veggie
FROM recipes
)
SELECT fruit as healthy_food
FROM cte
UNION -- <---
SELECT veggie as healthy_food
FROM cte;
Run Code Online (Sandbox Code Playgroud)
与 UNION ALL 然后 DISTINCT ON
WITH cte as (...)
SELECT DISTINCT ON (healthy_food) healthy_food FROM -- <---
(SELECT fruit as healthy_food
FROM cte
UNION ALL -- <---
SELECT veggie as healthy_food
FROM cte) tb;
Run Code Online (Sandbox Code Playgroud)
与 UNION ALL 然后 GROUP BY
WITH cte as (...)
SELECT fruit as healthy_food
FROM cte
UNION ALL -- <---
SELECT veggie as healthy_food
FROM cte
GROUP BY healthy_food; -- <---
Run Code Online (Sandbox Code Playgroud)
(并在 UNION 的每个 SELECT 上添加HAVING COUNT(*) = 1and )GROUP BY
UNION ALL 的执行速度非常快,但我尝试过的所有重复删除组合都需要 15 分钟以上。
考虑到 2 个字段/列来自同一个表并已建立索引,我该如何优化此查询?
(或者,跟踪所有唯一值的最便宜的方法是什么?也许是在 UNIQUE 表或视图上插入触发器?)
如果水果和/或蔬菜之间有很多重复项,但水果和蔬菜之间没有那么多重复项(如示例中的名称所示),并且由于您对它们都有索引,因此模拟索引跳过扫描(又名松散索引)扫描)会产生奇迹:
WITH RECURSIVE fruit AS (
(
SELECT fruit
FROM recipes
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT (SELECT fruit
FROM recipes
WHERE fruit > t.fruit
ORDER BY 1
LIMIT 1)
FROM fruit t
WHERE t.fruit IS NOT NULL
)
, veggie AS (
(
SELECT veggie
FROM recipes
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT (SELECT veggie
FROM recipes
WHERE veggie > t.veggie
ORDER BY 1
LIMIT 1)
FROM veggie t
WHERE t.veggie IS NOT NULL
)
SELECT DISTINCT healthy_food
FROM (
SELECT fruit AS healthy_food FROM fruit
UNION ALL
SELECT veggie AS healthy_food FROM veggie
) sub
WHERE healthy_food IS NOT NULL;
Run Code Online (Sandbox Code Playgroud)
只是DISTINCT代替DISTINCT ON(就像您尝试的那样)在outer 中SELECT,因为我们正在处理单个列。
看:
您不妨在外部使用+UNION代替。只是避免了这种情况,因为您明确要求这样做。但我不明白这一点。UNION ALLDISTINCTSELECT
| 归档时间: |
|
| 查看次数: |
3548 次 |
| 最近记录: |