如何在 PostgreSQL 的“NOT IN”语句中使用索引?

Str*_*667 5 postgresql performance index

架构

CREATE TABLE traffic_hit (
    id            SERIAL                   NOT NULL PRIMARY KEY,
    country       VARCHAR(2)               NOT NULL,
    created       TIMESTAMP WITH TIME ZONE NOT NULL,
    unique        BOOLEAN                  NOT NULL,
    user_agent_id INTEGER                  NULL
);
CREATE TABLE utils_useragent (
    id                SERIAL      NOT NULL PRIMARY KEY,
    user_agent_string TEXT        NOT NULL UNIQUE,
    is_robot          BOOLEAN     NOT NULL
);
Run Code Online (Sandbox Code Playgroud)

初始查询

SELECT
  traffic_hit.created::DATE AS group_by,
  COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
  COUNT(*) AS non_unique_visits
FROM
  traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
  traffic_hit.created >= '2016-01-01' AND
  traffic_hit.created < '2017-01-01' AND
  traffic_hit.country = 'CZ' AND
  utils_useragent.is_robot = FALSE
GROUP BY 1
Run Code Online (Sandbox Code Playgroud)

索引

CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUE
Run Code Online (Sandbox Code Playgroud)

查询计划

HashAggregate  (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
  Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
  Group Key: (traffic_hit.created)::date
  ->  Hash Join  (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
        Output: (traffic_hit.created)::date, traffic_hit.""unique"""
        Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
        ->  Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit  (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
              Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_hit.""unique"""
              Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
              Heap Fetches: 40448
        ->  Hash  (cost=10806.55..10806.55 rows=393991 width=4) (actual time=77.531..77.531 rows=393896 loops=1)
              Output: utils_useragent.id
              Buckets: 524288  Batches: 1  Memory Usage: 17944kB
              ->  Index Only Scan using utils_useragent_id_idx on public.utils_useragent  (cost=0.42..10806.55 rows=393991 width=4) (actual time=0.071..29.285 rows=393896 loops=1)
                    Output: utils_useragent.id
                    Heap Fetches: 5932
Planning time: 0.918 ms
Execution time: 2531.195 ms
Run Code Online (Sandbox Code Playgroud)

数据

大约有4000条记录与is_robot = TRUE和395000条记录与is_robot = FALSEutils_useragent表中。表 traffic_hit 包含 2016 年的大约 1200 万条记录。

目标

提高读取性能,因为查询用于报告应用程序并且对用户很重要。

我的方法

由于 utils_useragent 表中的“机器人”很少,因此使用部分索引应该会更快。我想使用的另一件事是仅用于索引扫描的多列索引

SELECT
  traffic_hit.created::DATE AS group_by,
  COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
  COUNT(*) AS non_unique_visits
FROM
  traffic_hit
WHERE
  traffic_hit.created >= '2016-01-01' AND
  traffic_hit.created < '2017-01-01' AND
  traffic_hit.country = 'CZ' AND
  user_agent_id NOT IN (select id from utils_useragent where is_robot = TRUE)
GROUP BY 1
Run Code Online (Sandbox Code Playgroud)

新的查询计划:

HashAggregate  (cost=486612.46..503842.68 rows=1378418 width=20) (actual time=2281.282..2282.627 rows=366 loops=1)
  Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
  Group Key: (traffic_hit.created)::date
  ->  Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit  (cost=275.23..467834.60 rows=2503714 width=5) (actual time=2.223..1922.960 rows=2430245 loops=1)
        Output: (traffic_hit.created)::date, traffic_hit.""unique"""
        Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
        Filter: (NOT (hashed SubPlan 1))
        Rows Removed by Filter: 2188625
        Heap Fetches: 40448
        SubPlan 1
          ->  Index Only Scan using only_robots on public.utils_useragent  (cost=0.28..265.32 rows=3739 width=4) (actual time=0.031..0.682 rows=3763 loops=1)
                Output: utils_useragent.id
                Heap Fetches: 0
Planning time: 0.510 ms
Execution time: 2297.849 ms
Run Code Online (Sandbox Code Playgroud)

新查询更快,但Filter: (NOT (hashed SubPlan 1))计划中有一部分,这让我感到困惑。

问题

为什么不使用索引来过滤user_agent_id?是否可以使用它来提高查询性能?或者其他一些方法会更好?

PostgreSQL 版本:9.6.3

jja*_*nes 4

它确实使用了索引。它使用索引构建哈希表,然后在过滤器中使用该哈希表。使用内存中非共享哈希表将比使用磁盘上共享索引更快。

但是,如果对性能敏感,为什么要重复聚合 6 个月内没有更改的数百万行数据呢?聚合一次并存储结果。您可以使用物化视图来执行此操作,或者只是手动执行此操作。

您可以进行部分聚合,例如聚合数据分组依据date(created)以及您需要的任何其他列。然后,人们可以将这个缩减的数据集重新聚合到特定的日期范围,只要他们对完整的日期边界感到满意,可以对其他列进行过滤,或者对它们进行聚合,或者按它们进行分组。如果他们想要计数,你必须小心地总结计数,而不是计数计数。如果您想要平均值,则必须小心地按计数对平均值进行加权,而不是对平均值进行未加权平均值。

当然,如果您改变了关于什么是机器人或不是机器人的想法,那么您将不得不重新制作部分聚合表。

不管怎样,瓶颈不是 not-in 语句,而是你想要处理的原始数据量。

即将推出的 PostgreSQL v10 版本中的并行查询可以帮助此查询。