如何避免 FILTER 子句中的子查询?

Str*_*667 3 postgresql performance subquery

架构

  CREATE TABLE "applications" (
  "id"             SERIAL                   NOT NULL PRIMARY KEY,
  "country"        VARCHAR(2)               NOT NULL,
  "created"        TIMESTAMP WITH TIME ZONE NOT NULL,
  "is_preliminary" BOOLEAN                  NOT NULL,
  "first_name"     VARCHAR(128)             NOT NULL,
  "last_name"      VARCHAR(128)             NOT NULL,
  "birth_number"   VARCHAR(11)              NULL
);

CREATE TABLE "persons" (
  "id"       UUID                     NOT NULL PRIMARY KEY,
  "created"  TIMESTAMP WITH TIME ZONE NOT NULL,
  "modified" TIMESTAMP WITH TIME ZONE NOT NULL
);

ALTER TABLE "applications" ADD COLUMN "physical_person_id" UUID NULL;
CREATE INDEX "physical_person_id_idx" ON "applications" ("physical_person_id");

ALTER TABLE "applications" ADD CONSTRAINT "physical_person_id_fk" FOREIGN KEY ("physical_person_id") REFERENCES "persons" ("id") DEFERRABLE INITIALLY DEFERRED;
CREATE INDEX "country_created" ON "applications" (country, created);
Run Code Online (Sandbox Code Playgroud)

备注: 的值persons.created应该与application.created此人的第一个相同,无论is_preliminary值如何。

查询

SELECT
  to_char(created, 'YYYY-MM-DD') AS "Date",
  COUNT(*) AS "Total",
  COALESCE(
    COUNT(*) FILTER(
      WHERE applications.is_preliminary = false
      AND NOT EXISTS(
        SELECT 1
        FROM applications A
        WHERE A.physical_person_id = applications.physical_person_id
          AND A.created < applications.created
        LIMIT 1
      )
    )
    , 0
  ) AS "Is first app"
FROM applications
WHERE
  created >= '2017-01-01'::TIMESTAMP AND created < '2017-07-01'::TIMESTAMP
  AND country = 'CZ'
GROUP BY 1
ORDER BY 1
Run Code Online (Sandbox Code Playgroud)

目标:我的目标是查看特定国家/地区每天的申请总数与首次申请的数量。第一次申请是指特定日期的许多申请,这些申请是第一次注册,之前没有申请。

问题:查询性能。行数正在增长,现在的性能不是很好。

数据样本这里xz压缩输出pg_dump

以下查询计划取自我的笔记本电脑(在生产中没有“外部合并”)

查询计划

 GroupAggregate  (cost=54186.11..2391221.59 rows=186832 width=48) (actual time=2137.029..3224.937 rows=181 loops=1)
   Group Key: (to_char(applications.created, 'YYYY-MM-DD'::text))
   ->  Sort  (cost=54186.11..54653.19 rows=186832 width=57) (actual time=2128.554..2370.798 rows=186589 loops=1)
         Sort Key: (to_char(applications.created, 'YYYY-MM-DD'::text))
         Sort Method: external merge  Disk: 8176kB
         ->  Bitmap Heap Scan on applications  (cost=5262.54..30803.18 rows=186832 width=57) (actual time=93.993..411.096 rows=186589 loops=1)
               Recheck Cond: (((country)::text = 'CZ'::text) AND (created >= '2017-01-01 00:00:00'::timestamp without time zone) AND (created < '2017-07-01 00:00:00'::timestamp without time zone))
               Heap Blocks: exact=19640
               ->  Bitmap Index Scan on country_created  (cost=0.00..5215.83 rows=186832 width=0) (actual time=90.945..90.945 rows=186589 loops=1)
                     Index Cond: (((country)::text = 'CZ'::text) AND (created >= '2017-01-01 00:00:00'::timestamp without time zone) AND (created < '2017-07-01 00:00:00'::timestamp without time zone))
   SubPlan 1
     ->  Index Scan using physical_person_id_idx on applications a  (cost=0.43..72.77 rows=6 width=0) (actual time=0.006..0.006 rows=1 loops=127558)
           Index Cond: (physical_person_id = applications.physical_person_id)
           Filter: (created < applications.created)
           Rows Removed by Filter: 0
 Planning time: 0.235 ms
 Execution time: 3261.530 ms
Run Code Online (Sandbox Code Playgroud)

问题:如何提高查询性能?我想,可以摆脱“是第一个应用程序”中的子查询,但我不知道如何。

PostgreSQL 版本:9.6.3

Evan Carroll 更新后的查询计划:

    Subquery Scan on t  (cost=51624.73..2390836.50 rows=186782 width=52) (actual time=291.726..1129.435 rows=181 loops=1)
 ->  GroupAggregate  (cost=51624.73..2388034.77 rows=186782 width=20) (actual time=291.707..1128.057 rows=181 loops=1)
       Group Key: ((applications.created)::date)
       ->  Sort  (cost=51624.73..52091.69 rows=186782 width=29) (actual time=280.283..334.391 rows=186589 loops=1)
             Sort Key: ((applications.created)::date)
             Sort Method: external merge  Disk: 6720kB
             ->  Bitmap Heap Scan on applications  (cost=5261.90..30801.54 rows=186782 width=29) (actual time=42.944..181.325 rows=186589 loops=1)
                   Recheck Cond: (((country)::text = 'CZ'::text) AND (created >= '2017-01-01 00:00:00+01'::timestamp with time zone) AND (created <= '2017-07-01 00:00:00+02'::timestamp with time zone))
                   Heap Blocks: exact=19640
                   ->  Bitmap Index Scan on country_created  (cost=0.00..5215.20 rows=186782 width=0) (actual time=40.003..40.003 rows=186589 loops=1)
                         Index Cond: (((country)::text = 'CZ'::text) AND (created >= '2017-01-01 00:00:00+01'::timestamp with time zone) AND (created <= '2017-07-01 00:00:00+02'::timestamp with time zone))
       SubPlan 1
         ->  Index Scan using physical_person_id_idx on applications a  (cost=0.43..72.77 rows=6 width=0) (actual time=0.006..0.006 rows=1 loops=127558)
               Index Cond: (physical_person_id = applications.physical_person_id)
               Filter: (created < applications.created)
               Rows Removed by Filter: 0
Planning time: 0.232 ms
Execution time: 1145.761 ms
Run Code Online (Sandbox Code Playgroud)

没有is_first_app列的初始查询需要大约 300 毫秒。

来自 Erwin Brandstetter 的替代解决方案的查询计划:

 GroupAggregate  (cost=51356.14..55562.83 rows=186964 width=20) (actual time=562.470..620.993 rows=181 loops=1)
   Group Key: ((a.created)::date)
   Buffers: shared hit=2137 read=4491, temp read=2491 written=2485
   ->  Sort  (cost=51356.14..51823.55 rows=186964 width=20) (actual time=562.216..592.226 rows=186589 loops=1)
         Sort Key: ((a.created)::date)
         Sort Method: external merge  Disk: 2640kB
         Buffers: shared hit=2137 read=4491, temp read=2491 written=2485
         ->  Hash Right Join  (cost=13394.71..31149.19 rows=186964 width=20) (actual time=119.488..464.407 rows=186589 loops=1)
               Hash Cond: ((p.id = a.physical_person_id) AND (p.created = a.created))
               Join Filter: (NOT a.is_preliminary)
               Buffers: shared hit=2137 read=4491, temp read=2159 written=2153
               ->  Seq Scan on persons p  (cost=0.00..9003.04 rows=364404 width=24) (actual time=3.800..73.486 rows=364404 loops=1)
                     Buffers: shared hit=868 read=4491
               ->  Hash  (cost=9311.25..9311.25 rows=186964 width=25) (actual time=115.213..115.213 rows=186589 loops=1)
                     Buckets: 65536  Batches: 4  Memory Usage: 2875kB
                     Buffers: shared hit=1269, temp written=681
                     ->  Index Only Scan using app_country_created_person_preliminary_idx on applications a  (cost=0.56..9311.25 rows=186964 width=25) (actual time=0.054..64.392 rows=186589 loops=1)
reated < '2017-07-01 00:00:00+02'::timestamp with time zone))
                           Heap Fetches: 0
                           Buffers: shared hit=1269
 Planning time: 0.401 ms
 Execution time: 628.100 ms
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 6

几个小的改进:

SELECT created::date AS the_date
     , COUNT(*) AS total
     , COUNT(*) FILTER( WHERE is_preliminary = false
                        AND   NOT EXISTS (
                           SELECT 1
                           FROM   applications
                           WHERE  physical_person_id = a.physical_person_id
                           AND    created < a.created
                        -- AND    created < a.created::date  -- alternative? see below
                        -- AND    is_preliminary = false     -- omission? see below
                        -- AND    country = 'CZ'             -- not sure. see below
                           LIMIT  1
                           )
                        ) AS is_first_app
FROM   applications a
WHERE  created >= '2017-01-01'::timestamptz
AND    created <  '2017-07-01'::timestamptz
AND    country = 'CZ'
GROUP  BY created::date
ORDER  BY created::date;
Run Code Online (Sandbox Code Playgroud)
  • COALESCE( count(...), 0)总是多余的噪音,因为count()从来没有返回NULL开始。只需将其删除。有关的:

  • 按照您的方式,您可以根据column的文本表示进行分组和排序,这恰好可以正常工作。但它比按实际日期(内部 4 字节整数值)分组和排序更昂贵。按实际日期或时间戳排序通常也更可靠,尽管它在此特定查询中没有任何区别。实现这一目标的最简单方法是迄今为止的普通转换:. 如果您愿意,您仍然可以格式化输出:。相同的结果,但由于 we ,您必须重复分组表达式。timestamptzcreatedcreated::dateto_char(created::date, 'YYYY-MM-DD') AS dateGROUP BY created::date

  • 千万不能使用BETWEEN像已被告知。您的过滤器与>=<更胜一筹。BETWEEN将转换为>=and <=,这会导致在timestamp(or timestamptz) 中带有分数的丑陋极端情况。但是由于底层列的数据类型是timestamptz,因此timestamptz直接转换为 。相同的结果,只是少了一个强制转换操作:

    WHERE  created >= '2017-01-01'::timestamptz
    AND    created <  '2017-07-01'::timestamptz
    
    Run Code Online (Sandbox Code Playgroud)
  • 您知道从值派生的日期timestamptz(以及timestamptz不指定时区的转换)始终取决于当前时区设置,对吗?如果您想消除这个狡猾的错误源,您可以明确地将查询放在选定的时区中。基本:

  • 的计算中可能存在逻辑错误is_first_app。不过,这只是我的猜测:您正在检查applications同一个人的任何行是否早于当前行。但是,虽然您只允许is_preliminary = false当前行,但您不会为要比较的行强制执行相同的谓词。通常,您希望与也是is_preliminary = false. 我在上面的查询中添加了注释行。

    此外,由于您每天组成组,您是否真的想计算一天有前一个条目的行?也许是这样,但也许你真的要检查行早于当天created < a.created::date

    最后,更不确定的是,您可能想要重复谓词AND country = 'CZ'以限制与同一国家/地区的比较。我没有足够的信息来多说。

  • 我还通过修剪噪音双引号(无论如何所有标识符都是合法的)并applications a在外部使用战略表别名 ( ) 来缩短语法SELECT

指数

由于您关心优化读取性能......

您的多列索引country_created似乎非常适合外部SELECT. 但是请继续阅读...

但是您可以EXISTS使用另一个多列索引轻松改进子查询:

CREATE INDEX app_person_created_idx ON applications (physical_person_id, created);
Run Code Online (Sandbox Code Playgroud)

允许仅索引扫描(仅当您的写入模式允许时!):

CREATE INDEX app_country_created_person_preliminary_idx
ON applications (country, created, physical_person_id, is_preliminary);
Run Code Online (Sandbox Code Playgroud)

附加列physical_person_idis_preliminary唯一明智的,如果你仅索引扫描的出来。

添加最后一个索引后,我得到了两次仅索引扫描,这对于大表来说要快得多

有关仅索引扫描的更多信息:

替代方案

您的最后一条评论打开了新选项:

首次创建应用程序时,也会创建一个具有相同创造价值的新人。

(问题中较早的陈述过于模棱两可,无法使用它。)

如果这被可靠地强制执行(并且created永远不会在任何一个表中更新),那么有一个更简单、更快的查询,它也恰好“避免FILTER子句中的子查询” - 通过使用 aLEFT [OUTER] JOIN代替:

SELECT a.created::date AS date
     , COUNT(*)        AS total
     , COUNT(p.id)     AS is_first_app  -- count only counts non-null values
FROM   applications a
LEFT   JOIN persons p ON a.is_preliminary = false
                     AND p.id = a.physical_person_id  -- FK enforces max. 1 match
                     AND p.created = a.created
WHERE  a.created >= '2017-01-01'::timestamptz
AND    a.created <  '2017-07-01'::timestamptz
AND    a.country = 'CZ'
GROUP  BY a.created::date
ORDER  BY a.created::date;
Run Code Online (Sandbox Code Playgroud)

为了通过两个仅索引扫描获得完美的读取性能,您可以app_country_created_person_preliminary_idx从上面获得索引。加上这个persons

CREATE INDEX pers_id_created ON persons (id, created);
Run Code Online (Sandbox Code Playgroud)