在 PostgreSQL 上从 UNION ALL(不使用 UNION)删除重复项的最快方法?

Z. *_* M. 6 sql postgresql query-optimization

我有一个包含数亿行的表,我想从同一个表的 2 个索引列中获取唯一值的单个列表(没有唯一的行 ID)

为了说明这一点,假设我们有一个包含一fruits列和一veggies列的表,我想构建一个healthy_foods包含两列中唯一值的列表。

我尝试过以下查询:

与联盟

WITH cte as (
    SELECT fruit, veggie
    FROM recipes
)
SELECT fruit as healthy_food
         FROM cte
         UNION --  <--- 
         SELECT veggie as healthy_food
         FROM cte;
Run Code Online (Sandbox Code Playgroud)

与 UNION ALL 然后 DISTINCT ON

WITH cte as (...)
SELECT DISTINCT ON (healthy_food) healthy_food FROM  --  <--- 
(SELECT fruit as healthy_food
     FROM cte
     UNION ALL --  <--- 
     SELECT veggie as healthy_food
     FROM cte) tb;
Run Code Online (Sandbox Code Playgroud)

与 UNION ALL 然后 GROUP BY

WITH cte as (...)
SELECT fruit as healthy_food
         FROM cte
         UNION ALL --  <--- 
         SELECT veggie as healthy_food
         FROM cte
GROUP BY healthy_food; --  <--- 
Run Code Online (Sandbox Code Playgroud)

(并在 UNION 的每个 SELECT 上添加HAVING COUNT(*) = 1and )GROUP BY

UNION ALL 的执行速度非常快,但我尝试过的所有重复删除组合都需要 15 分钟以上。

考虑到 2 个字段/列来自同一个表并已建立索引,我该如何优化此查询?

(或者,跟踪所有唯一值的最便宜的方法是什么?也许是在 UNIQUE 表或视图上插入触发器?)

Erw*_*ter 4

如果水果和/或蔬菜之间有很多重复项,但水果和蔬菜之间没有那么多重复项(如示例中的名称所示),并且由于您对它们都有索引,因此模拟索引跳过扫描(又名松散索引)扫描)会产生奇迹:

WITH RECURSIVE fruit AS (
   (
   SELECT fruit
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT fruit
           FROM   recipes
           WHERE  fruit > t.fruit
           ORDER  BY 1
           LIMIT  1)
   FROM   fruit t
   WHERE  t.fruit IS NOT NULL
   )
 , veggie AS (
   (
   SELECT veggie
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT veggie
           FROM   recipes
           WHERE  veggie > t.veggie
           ORDER  BY 1
           LIMIT  1)
   FROM   veggie t
   WHERE  t.veggie IS NOT NULL
   )
SELECT DISTINCT healthy_food
FROM  (
   SELECT fruit AS healthy_food FROM fruit
   UNION  ALL
   SELECT veggie AS healthy_food FROM veggie
   ) sub
WHERE  healthy_food IS NOT NULL;
Run Code Online (Sandbox Code Playgroud)

只是DISTINCT代替DISTINCT ON(就像您尝试的那样)在outer 中SELECT,因为我们正在处理单个列。

看:

您不妨在外部使用+UNION代替。只是避免了这种情况,因为您明确要求这样做。但我不明白这一点。UNION ALLDISTINCTSELECT

  • 这让我大吃一惊!`UNION` 本身需要 16 分钟,`GROUP BY` 变体需要 &lt;6 分钟,但这一个在... **33 秒!** 向你致敬(我将进一步研究这个主题,因为你的回复激起了我极大的好奇心) (3认同)