为什么 OR 语句比 UNION 语句慢？

Question

为什么 OR 语句比 UNION 语句慢？

Moh*_*ein 11 postgresql performance postgresql-12

数据库版本：PostgreSQL 12.6

我有一张有 600,000 条记录的表。

该表具有以下列：

名称 (varchar)
location_type (int) 枚举值：(1,2,3)
血统（varchar）

索引：

血统（btree）

祖先列是一种构建树的方法，其中每一行都有一个祖先，其中包含所有以“/”分隔的父 ID。

考虑以下示例：

ID	姓名	祖先
1	根	空值
5	节点	'1'
12	节点	'1/5'
22	叶子	'1/5/12'

以下查询需要 686 毫秒来执行：

SELECT * FROM geolocations
WHERE EXISTS (
   SELECT 1 FROM geolocations g2
   WHERE g2.ancestry =
      CONCAT(geolocations.ancestry, '/', geolocations.id)
)

Run Code Online (Sandbox Code Playgroud)

此查询在 808 毫秒内运行：

SELECT * FROM geolocations
WHERE location_type = 2

Run Code Online (Sandbox Code Playgroud)

将两个查询与 OR 结合使用时，如果它完成，则需要大约 4475 毫秒才能完成。

SELECT * FROM geolocations
WHERE EXISTS (
   SELECT 1 FROM geolocations g2
   WHERE g2.ancestry =
      CONCAT(geolocations.ancestry, '/', geolocations.id)
) OR location_type = 2

Run Code Online (Sandbox Code Playgroud)

解释：

[
  {
    "Plan": {
      "Node Type": "Seq Scan",
      "Parallel Aware": false,
      "Relation Name": "geolocations",
      "Alias": "geolocations",
      "Startup Cost": 0,
      "Total Cost": 2760473.54,
      "Plan Rows": 582910,
      "Plan Width": 68,
      "Filter": "((SubPlan 1) OR (location_type = 2))",
      "Plans": [
        {
          "Node Type": "Index Only Scan",
          "Parent Relationship": "SubPlan",
          "Subplan Name": "SubPlan 1",
          "Parallel Aware": false,
          "Scan Direction": "Forward",
          "Index Name": "index_geolocations_on_ancestry",
          "Relation Name": "geolocations",
          "Alias": "g2",
          "Startup Cost": 0.43,
          "Total Cost": 124.91,
          "Plan Rows": 30,
          "Plan Width": 0,
          "Index Cond": "(ancestry = concat(geolocations.ancestry, '/', geolocations.id))"
        }
      ]
    },
    "JIT": {
      "Worker Number": -1,
      "Functions": 8,
      "Options": {
        "Inlining": true,
        "Optimization": true,
        "Expressions": true,
        "Deforming": true
      }
    }
  }
]

Run Code Online (Sandbox Code Playgroud)

将它们与联合结合需要 1916 毫秒：

SELECT * FROM geolocations
WHERE EXISTS (
   SELECT 1 FROM geolocations g2
   WHERE g2.ancestry =
      CONCAT(geolocations.ancestry, '/', geolocations.id)
) UNION SELECT * FROM geolocations WHERE location_type = 2

Run Code Online (Sandbox Code Playgroud)

解释

[
  {
    "Plan": {
      "Node Type": "Unique",
      "Parallel Aware": false,
      "Startup Cost": 308693.44,
      "Total Cost": 332506.74,
      "Plan Rows": 865938,
      "Plan Width": 188,
      "Plans": [
        {
          "Node Type": "Sort",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Startup Cost": 308693.44,
          "Total Cost": 310858.29,
          "Plan Rows": 865938,
          "Plan Width": 188,
          "Sort Key": [
            "geolocations.id",
            "geolocations.name",
            "geolocations.location_type",
            "geolocations.pricing",
            "geolocations.ancestry",
            "geolocations.geolocationable_id",
            "geolocations.geolocationable_type",
            "geolocations.created_at",
            "geolocations.updated_at",
            "geolocations.info"
          ],
          "Plans": [
            {
              "Node Type": "Append",
              "Parent Relationship": "Outer",
              "Parallel Aware": false,
              "Startup Cost": 15851.41,
              "Total Cost": 63464.05,
              "Plan Rows": 865938,
              "Plan Width": 188,
              "Subplans Removed": 0,
              "Plans": [
                {
                  "Node Type": "Hash Join",
                  "Parent Relationship": "Member",
                  "Parallel Aware": false,
                  "Join Type": "Inner",
                  "Startup Cost": 15851.41,
                  "Total Cost": 35074.94,
                  "Plan Rows": 299882,
                  "Plan Width": 68,
                  "Inner Unique": true,
                  "Hash Cond": "(concat(geolocations.ancestry, '/', geolocations.id) = (g2.ancestry)::text)",
                  "Plans": [
                    {
                      "Node Type": "Seq Scan",
                      "Parent Relationship": "Outer",
                      "Parallel Aware": false,
                      "Relation Name": "geolocations",
                      "Alias": "geolocations",
                      "Startup Cost": 0,
                      "Total Cost": 13900.63,
                      "Plan Rows": 599763,
                      "Plan Width": 68
                    },
                    {
                      "Node Type": "Hash",
                      "Parent Relationship": "Inner",
                      "Parallel Aware": false,
                      "Startup Cost": 15600.65,
                      "Total Cost": 15600.65,
                      "Plan Rows": 20061,
                      "Plan Width": 12,
                      "Plans": [
                        {
                          "Node Type": "Aggregate",
                          "Strategy": "Hashed",
                          "Partial Mode": "Simple",
                          "Parent Relationship": "Outer",
                          "Parallel Aware": false,
                          "Startup Cost": 15400.04,
                          "Total Cost": 15600.65,
                          "Plan Rows": 20061,
                          "Plan Width": 12,
                          "Group Key": [
                            "(g2.ancestry)::text"
                          ],
                          "Plans": [
                            {
                              "Node Type": "Seq Scan",
                              "Parent Relationship": "Outer",
                              "Parallel Aware": false,
                              "Relation Name": "geolocations",
                              "Alias": "g2",
                              "Startup Cost": 0,
                              "Total Cost": 13900.63,
                              "Plan Rows": 599763,
                              "Plan Width": 12
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "Node Type": "Seq Scan",
                  "Parent Relationship": "Member",
                  "Parallel Aware": false,
                  "Relation Name": "geolocations",
                  "Alias": "geolocations_1",
                  "Startup Cost": 0,
                  "Total Cost": 15400.04,
                  "Plan Rows": 566056,
                  "Plan Width": 68,
                  "Filter": "(location_type = 2)"
                }
              ]
            }
          ]
        }
      ]
    },
    "JIT": {
      "Worker Number": -1,
      "Functions": 15,
      "Options": {
        "Inlining": false,
        "Optimization": false,
        "Expressions": true,
        "Deforming": true
      }
    }
  }
]

Run Code Online (Sandbox Code Playgroud)

为什么 PostgreSQL 执行 OR 查询的速度要慢得多？

Answer 1

Erw*_*ter 20

杰夫已经暗示过这一点，但我觉得有必要指出房间里的大象：
这两个查询是不等价的！

UNION 删除SELECT列表中的所有重复项。
而另一个查询与OR保留它们。

您有SELECT * FROM geolocations，并且FROM列表中没有其他表。因此，如果表中没有重复的行（这由包括和约束的列UNIQUE上的任何索引保证），则结果中不能有重复，这两个查询毕竟是等价的。但是任何或任何具有（非唯一）列子集的列表都会在结果中引入重复的行！NOT NULLPRIMARY KEYUNIQUEJOINSELECT

使用UNION ALL代替甚至更远。它会产生OR-ed 谓词不会的重复项。如果同一行符合多个OR-ed 谓词，则它符合一次。重写UNION ALL将多次选择同一行。

没有办法过滤“坏”重复项并用UNION/保留“好”重复项UNION ALL。所以 Postgres 通常不能OR用UNION计划代替“丑陋”的情况。即使在可能的情况下，也不确定UNION实际上会更快。

但是 Postgres 通常可以OR在位图索引扫描中组合同一个表上的多个-ed 谓词。看：

什么是“位图索引”？

一个“丑OR”是其中不同的基础表谓词OR-ed在一起。（或者甚至是同一个表，但像您的情况一样，但是不同的实例。）即使不涉及索引，这也会使查询更加昂贵。但是，当有效的索引扫描被此阻止时，它会变得特别昂贵。（不同表上的索引不能在位图索引扫描中组合！）通常对于返回很小百分比的底层大表的选择性查询最重要。当无论如何必须读取超过百分之几的行时，索引扫描就会失去作用。Ugly ORs 在这些查询中没有那么严重。

Laurenz Albe 的相关博文：

https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/

解决方案

首先，你的使用CONCAT()是不正确的。它将后代与/根节点的前导连接起来ancestry IS NULL。喜欢'/1'而不是'1'. CONCAT_WS()会是正确的。喜欢：

SELECT *
FROM   geolocations g
WHERE  EXISTS (
   SELECT FROM geolocations g2
   WHERE  g2.ancestry = CONCAT_WS('/', g.ancestry, g.id)  -- !
   )
    OR location_type = 2;

Run Code Online (Sandbox Code Playgroud)

看：

如何连接 Postgres SELECT 中的列？

还是丑。如果你经常运行这样的查询，你可能会做更多的事情。如果表是只读的，请添加一个boolean名为的标志has_children。否则，请考虑MATERIALIZED VIEW使用该额外列或使用触发器使表列保持最新。那么您的查询可以是：

SELECT *
FROM   geolocations
WHERE  has_children
    OR location_type = 2;

Run Code Online (Sandbox Code Playgroud)

has_children通常不是选择性的，因此查询会产生大量结果行并且永远不会非常便宜（尽管便宜很多）。索引该列无济于事。我们需要完整的信息才能找到不同的角度。这超出了这个问题的范围。

无论哪种方式，如果您需要ancestry每一行都具有冗余，请考虑一个带有 GIN 索引的适当数组，而不是丑陋的字符串。或者可能是附加模块ltree，它是旧的，但正是为此目的。

Answer 2

Cha*_*ace 12

PostgreSQL 和许多其他 RDBMS 经常与OR谓词斗争。

经常发生的情况，并且在这种情况下已经发生，编译器决定它无法OR通过单次查找实现这两个条件，而是扫描整个索引，评估每一行的两个（或更多）条件。

尽管索引联合的方法更明显（对人类而言）。

你正在做的是一个非常常见的技巧来帮助编译器并强制索引联合。它现在完全独立地评估两侧，在这种情况下它要快得多。

它可能并不总是更快，例如，如果location_type = 2它占表的很大一部分。当两个条件在性能上有很大差异时，好处更明显。

例如，WHERE id = @id OR someName = @name第一个条件是在单行上直接搜索，而第二个条件是对几行的搜索。编译器无法通过单次查找来满足这一点，因此它经常跳转到扫描整个表。索引联合在这里有所帮助，因为您可以利用一个索引id和另一个索引someName

Answer 3

jja*_*nes 6

只要设置足够大（超过大约 20MB），EXISTS无论有没有查询，我的查询速度基本相同。OR location_type = 2work_mem

对于没有 OR 子句的 EXISTS，我得到一个哈希连接。使用 OR 子句，我得到了 Hashed SubPlan。问题是，虽然 Hash Join 知道如何通过以有效的方式溢出到磁盘来处理低 work_mem，但 Hashed SubPlan 不知道。如果 work_mem 太低，它只会更改为较慢的未散列的 SubPlan。

work_mem 的默认设置对于运行此类查询的服务器来说太小了。

自动转换为 UNION 或 UNION ALL 很棘手，因为很难证明这给出了与 OR 相同的答案。决定不关心是否删除了一些明显的重复项是您的特权，但不是数据库为您做出该决定的特权。

归档时间：	4 年，5 月前
查看次数：	3097 次
最近记录：	4 年，3 月前