Postgres NOT NULL 优化

ste*_*dev 1 postgresql optimization query-performance

我正在尝试优化这个 SQL 查询:

select topics.id from "topics"
     left join "articles_topics" on "topics"."id" = "articles_topics"."topic_id"
     left join "articles" on "articles_topics"."article_id" = "articles"."id"
where not "topics"."type" = 'sport' and "articles"."image" is not null
group by "topics"."id"
having COUNT(articles.id) > 10
Run Code Online (Sandbox Code Playgroud)

这是完整的查询成本(我使用过EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

Finalize HashAggregate  (cost=12881.12..12974.90 rows=2501 width=8) (actual time=209.037..210.463 rows=1381 loops=1)
  Output: topics.id
  Group Key: topics.id
  Filter: (count(articles.id) > 10)
  Rows Removed by Filter: 5672
  Buffers: shared hit=8624
  ->  Gather  (cost=12018.39..12843.61 rows=7502 width=16) (actual time=198.146..205.348 rows=10376 loops=1)
"        Output: topics.id, (PARTIAL count(articles.id))"
        Workers Planned: 1
        Workers Launched: 1
        Buffers: shared hit=8624
        ->  Partial HashAggregate  (cost=11018.39..11093.41 rows=7502 width=16) (actual time=192.791..194.319 rows=5188 loops=2)
"              Output: topics.id, PARTIAL count(articles.id)"
              Group Key: topics.id
              Buffers: shared hit=8624
              Worker 0: actual time=188.316..190.218 rows=5394 loops=1
                Buffers: shared hit=3745
              ->  Hash Join  (cost=7499.10..10515.66 rows=100546 width=16) (actual time=54.077..159.378 rows=63672 loops=2)
"                    Output: topics.id, articles.id"
                    Inner Unique: true
                    Hash Cond: (articles_topics.topic_id = topics.id)
                    Buffers: shared hit=8624
                    Worker 0: actual time=47.006..148.933 rows=65595 loops=1
                      Buffers: shared hit=3745
                    ->  Parallel Hash Join  (cost=6948.79..9699.13 rows=101364 width=16) (actual time=48.622..113.016 rows=87035 loops=2)
"                          Output: articles_topics.topic_id, articles.id"
                          Inner Unique: true
                          Hash Cond: (articles_topics.article_id = articles.id)
                          Buffers: shared hit=7900
                          Worker 0: actual time=39.510..116.939 rows=90075 loops=1
                            Buffers: shared hit=3383
                          ->  Parallel Seq Scan on public.articles_topics  (cost=0.00..2464.56 rows=108856 width=16) (actual time=0.010..17.317 rows=92554 loops=2)
"                                Output: articles_topics.article_id, articles_topics.topic_id"
                                Buffers: shared hit=1376
                                Worker 0: actual time=0.010..21.592 rows=96072 loops=1
                                  Buffers: shared hit=732
                          ->  Parallel Hash  (cost=6720.30..6720.30 rows=18279 width=8) (actual time=46.963..46.964 rows=21942 loops=2)
                                Output: articles.id
                                Buckets: 65536  Batches: 1  Memory Usage: 2240kB
                                Buffers: shared hit=6524
                                Worker 0: actual time=39.462..39.462 rows=17804 loops=1
                                  Buffers: shared hit=2651
                                ->  Parallel Seq Scan on public.articles  (cost=0.00..6720.30 rows=18279 width=8) (actual time=0.010..30.455 rows=21942 loops=2)
                                      Output: articles.id
                                      Filter: (articles.image IS NOT NULL)
                                      Rows Removed by Filter: 1636
                                      Buffers: shared hit=6524
                                      Worker 0: actual time=0.014..26.579 rows=17804 loops=1
                                        Buffers: shared hit=2651
                    ->  Hash  (cost=456.54..456.54 rows=7502 width=8) (actual time=5.394..5.394 rows=7502 loops=2)
                          Output: topics.id
                          Buckets: 8192  Batches: 1  Memory Usage: 358kB
                          Buffers: shared hit=724
                          Worker 0: actual time=7.437..7.437 rows=7502 loops=1
                            Buffers: shared hit=362
                          ->  Seq Scan on public.topics  (cost=0.00..456.54 rows=7502 width=8) (actual time=0.022..2.176 rows=7502 loops=2)
                                Output: topics.id
                                Filter: ((topics.type)::text <> 'sport'::text)
                                Rows Removed by Filter: 61
                                Buffers: shared hit=724
                                Worker 0: actual time=0.027..2.189 rows=7502 loops=1
                                  Buffers: shared hit=362
Planning Time: 1.580 ms
Execution Time: 211.823 ms
Run Code Online (Sandbox Code Playgroud)

我尝试使用索引,移至and "articles"."image" is not null连接articles部分...,我也尝试过:https ://stackoverflow.com/questions/31966218/postgresql-create-an-index-to-quickly-distinguish-null-from-非空值 但没有任何改进。我们可以以某种方式优化这个查询吗?

创建脚本:

CREATE TABLE public.articles (
    id bigserial NOT NULL,
    long_id varchar(255) NOT NULL,
    title varchar(1023) NULL,
    summary text NULL,
    is_top bool NULL DEFAULT false,
    date_published timestamptz NULL,
    image varchar(1023) NULL,
    original_url varchar(2047) NULL,
    created_at timestamptz NULL,
    updated_at timestamptz NULL,
    CONSTRAINT articles_long_id_unique UNIQUE (long_id),
    CONSTRAINT articles_pkey PRIMARY KEY (id)
);
CREATE INDEX articles_date_published_id_index ON public.articles USING btree (date_published, id);
CREATE INDEX articles_date_published_index ON public.articles USING btree (date_published);

CREATE TABLE public.topics (
    id bigserial NOT NULL,
    long_id varchar(255) NOT NULL,
    "name" varchar(255) NULL,
    icon_name varchar(255) NULL,
    "type" varchar(255) NULL,
    created_at timestamptz NULL,
    updated_at timestamptz NULL,
    logo varchar(255) NULL,
    image varchar(255) NULL,
    short_name varchar(255) NULL,
    source_id varchar(255) NULL,
    source_type_id int4 NULL,
    full_name varchar(255) NULL,
    nick_name varchar(255) NULL,
    first_name varchar(255) NULL,
    surname varchar(255) NULL,
    parent_topic_id int8 NULL,
    CONSTRAINT topics_long_id_unique UNIQUE (long_id),
    CONSTRAINT topics_pkey PRIMARY KEY (id),
    CONSTRAINT topics_type_source_id_unique UNIQUE (type, source_id),
    CONSTRAINT topics_parent_topic_id_foreign FOREIGN KEY (parent_topic_id) REFERENCES topics(id) ON DELETE SET NULL
);
CREATE INDEX topics_type_index ON public.topics USING btree (type);

CREATE TABLE public.articles_topics (
    article_id int8 NOT NULL,
    topic_id int8 NOT NULL,
    CONSTRAINT articles_topics_pkey PRIMARY KEY (article_id, topic_id),
    CONSTRAINT articles_topics_article_id_foreign FOREIGN KEY (article_id) REFERENCES articles(id) ON UPDATE CASCADE ON DELETE CASCADE,
    CONSTRAINT articles_topics_topic_id_foreign FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
);
CREATE INDEX articles_topics_topic_id_index ON public.articles_topics USING btree (topic_id);
Run Code Online (Sandbox Code Playgroud)

jja*_*nes 5

其实速度不是问题,问题是成本

解释中报告的成本只是对速度的粗略估计,所以我不明白这种区别。

如果问题是高并发下的速度,我会关闭并行化(max_parallel_workers_per_gather=0)。

但实际上,这看起来并不像那种结果经常变化的查询,或者需要绝对准确性的查询。因此,请研究 matview 或其他形式的缓存。