ste*_*dev 1 postgresql optimization query-performance
我正在尝试优化这个 SQL 查询:
select topics.id from "topics"
left join "articles_topics" on "topics"."id" = "articles_topics"."topic_id"
left join "articles" on "articles_topics"."article_id" = "articles"."id"
where not "topics"."type" = 'sport' and "articles"."image" is not null
group by "topics"."id"
having COUNT(articles.id) > 10
Run Code Online (Sandbox Code Playgroud)
这是完整的查询成本(我使用过EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
)
Finalize HashAggregate (cost=12881.12..12974.90 rows=2501 width=8) (actual time=209.037..210.463 rows=1381 loops=1)
Output: topics.id
Group Key: topics.id
Filter: (count(articles.id) > 10)
Rows Removed by Filter: 5672
Buffers: shared hit=8624
-> Gather (cost=12018.39..12843.61 rows=7502 width=16) (actual time=198.146..205.348 rows=10376 loops=1)
" Output: topics.id, (PARTIAL count(articles.id))"
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=8624
-> Partial HashAggregate (cost=11018.39..11093.41 rows=7502 width=16) (actual time=192.791..194.319 rows=5188 loops=2)
" Output: topics.id, PARTIAL count(articles.id)"
Group Key: topics.id
Buffers: shared hit=8624
Worker 0: actual time=188.316..190.218 rows=5394 loops=1
Buffers: shared hit=3745
-> Hash Join (cost=7499.10..10515.66 rows=100546 width=16) (actual time=54.077..159.378 rows=63672 loops=2)
" Output: topics.id, articles.id"
Inner Unique: true
Hash Cond: (articles_topics.topic_id = topics.id)
Buffers: shared hit=8624
Worker 0: actual time=47.006..148.933 rows=65595 loops=1
Buffers: shared hit=3745
-> Parallel Hash Join (cost=6948.79..9699.13 rows=101364 width=16) (actual time=48.622..113.016 rows=87035 loops=2)
" Output: articles_topics.topic_id, articles.id"
Inner Unique: true
Hash Cond: (articles_topics.article_id = articles.id)
Buffers: shared hit=7900
Worker 0: actual time=39.510..116.939 rows=90075 loops=1
Buffers: shared hit=3383
-> Parallel Seq Scan on public.articles_topics (cost=0.00..2464.56 rows=108856 width=16) (actual time=0.010..17.317 rows=92554 loops=2)
" Output: articles_topics.article_id, articles_topics.topic_id"
Buffers: shared hit=1376
Worker 0: actual time=0.010..21.592 rows=96072 loops=1
Buffers: shared hit=732
-> Parallel Hash (cost=6720.30..6720.30 rows=18279 width=8) (actual time=46.963..46.964 rows=21942 loops=2)
Output: articles.id
Buckets: 65536 Batches: 1 Memory Usage: 2240kB
Buffers: shared hit=6524
Worker 0: actual time=39.462..39.462 rows=17804 loops=1
Buffers: shared hit=2651
-> Parallel Seq Scan on public.articles (cost=0.00..6720.30 rows=18279 width=8) (actual time=0.010..30.455 rows=21942 loops=2)
Output: articles.id
Filter: (articles.image IS NOT NULL)
Rows Removed by Filter: 1636
Buffers: shared hit=6524
Worker 0: actual time=0.014..26.579 rows=17804 loops=1
Buffers: shared hit=2651
-> Hash (cost=456.54..456.54 rows=7502 width=8) (actual time=5.394..5.394 rows=7502 loops=2)
Output: topics.id
Buckets: 8192 Batches: 1 Memory Usage: 358kB
Buffers: shared hit=724
Worker 0: actual time=7.437..7.437 rows=7502 loops=1
Buffers: shared hit=362
-> Seq Scan on public.topics (cost=0.00..456.54 rows=7502 width=8) (actual time=0.022..2.176 rows=7502 loops=2)
Output: topics.id
Filter: ((topics.type)::text <> 'sport'::text)
Rows Removed by Filter: 61
Buffers: shared hit=724
Worker 0: actual time=0.027..2.189 rows=7502 loops=1
Buffers: shared hit=362
Planning Time: 1.580 ms
Execution Time: 211.823 ms
Run Code Online (Sandbox Code Playgroud)
我尝试使用索引,移至and "articles"."image" is not null
连接articles
部分...,我也尝试过:https ://stackoverflow.com/questions/31966218/postgresql-create-an-index-to-quickly-distinguish-null-from-非空值
但没有任何改进。我们可以以某种方式优化这个查询吗?
创建脚本:
CREATE TABLE public.articles (
id bigserial NOT NULL,
long_id varchar(255) NOT NULL,
title varchar(1023) NULL,
summary text NULL,
is_top bool NULL DEFAULT false,
date_published timestamptz NULL,
image varchar(1023) NULL,
original_url varchar(2047) NULL,
created_at timestamptz NULL,
updated_at timestamptz NULL,
CONSTRAINT articles_long_id_unique UNIQUE (long_id),
CONSTRAINT articles_pkey PRIMARY KEY (id)
);
CREATE INDEX articles_date_published_id_index ON public.articles USING btree (date_published, id);
CREATE INDEX articles_date_published_index ON public.articles USING btree (date_published);
CREATE TABLE public.topics (
id bigserial NOT NULL,
long_id varchar(255) NOT NULL,
"name" varchar(255) NULL,
icon_name varchar(255) NULL,
"type" varchar(255) NULL,
created_at timestamptz NULL,
updated_at timestamptz NULL,
logo varchar(255) NULL,
image varchar(255) NULL,
short_name varchar(255) NULL,
source_id varchar(255) NULL,
source_type_id int4 NULL,
full_name varchar(255) NULL,
nick_name varchar(255) NULL,
first_name varchar(255) NULL,
surname varchar(255) NULL,
parent_topic_id int8 NULL,
CONSTRAINT topics_long_id_unique UNIQUE (long_id),
CONSTRAINT topics_pkey PRIMARY KEY (id),
CONSTRAINT topics_type_source_id_unique UNIQUE (type, source_id),
CONSTRAINT topics_parent_topic_id_foreign FOREIGN KEY (parent_topic_id) REFERENCES topics(id) ON DELETE SET NULL
);
CREATE INDEX topics_type_index ON public.topics USING btree (type);
CREATE TABLE public.articles_topics (
article_id int8 NOT NULL,
topic_id int8 NOT NULL,
CONSTRAINT articles_topics_pkey PRIMARY KEY (article_id, topic_id),
CONSTRAINT articles_topics_article_id_foreign FOREIGN KEY (article_id) REFERENCES articles(id) ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT articles_topics_topic_id_foreign FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
);
CREATE INDEX articles_topics_topic_id_index ON public.articles_topics USING btree (topic_id);
Run Code Online (Sandbox Code Playgroud)
其实速度不是问题,问题是成本
解释中报告的成本只是对速度的粗略估计,所以我不明白这种区别。
如果问题是高并发下的速度,我会关闭并行化(max_parallel_workers_per_gather=0)。
但实际上,这看起来并不像那种结果经常变化的查询,或者需要绝对准确性的查询。因此,请研究 matview 或其他形式的缓存。