Leo*_*sky 3 postgresql explain
我尝试为我的数据创建报告,但在大表上真的很慢。
表结构为:
CREATE TABLE posts
(
id serial NOT NULL,
project_id integer,
moderation character varying(255),
keyword_id integer,
author_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
server_id character varying(255),
social_creation_time integer,
social_id character varying(255),
network character varying(255),
mood character varying(255) DEFAULT NULL::character varying,
url text,
source_id integer,
location character varying(255),
subject_id integer,
conversation_id integer,
CONSTRAINT posts_pkey PRIMARY KEY (id)
);
CREATE INDEX index_posts_on_author_id ON posts (author_id);
CREATE INDEX index_posts_on_keyword_id ON posts (keyword_id);
CREATE INDEX index_posts_on_project_id_and_network_and_social_id
ON posts (project_id, network, social_id);
CREATE INDEX index_posts_on_project_id_and_social_creation_time
ON posts (project_id, social_creation_time DESC);
CREATE INDEX index_posts_on_server_id ON posts (server_id);
CREATE INDEX index_posts_on_social_id ON posts (social_id);
Run Code Online (Sandbox Code Playgroud)
查询:
SELECT date_trunc('hour', timestamp 'epoch'
+ (posts.social_creation_time * INTERVAL '1 second')) creating,
network,
count(*) posts
FROM posts
WHERE posts.project_id = 7
AND (posts.moderation NOT IN ('junk','spam'))
AND (posts.social_creation_time BETWEEN 1391716800 AND 1392839999)
GROUP BY network, creating
ORDER BY creating
Run Code Online (Sandbox Code Playgroud)
计数为 3940689。
解释计划
GroupAggregate (cost=631282.11..671932.05 rows=338750 width=12) (actual time=22576.318..23826.124 rows=1776 loops=1)
-> Sort (cost=631282.11..639750.85 rows=3387494 width=12) (actual time=22576.188..23438.485 rows=3536790 loops=1)
Sort Key: (date_trunc('hour'::text, ('1970-01-01 00:00:00'::timestamp without time zone + ((social_creation_time)::double precision * '00:00:01'::interval)))), network
Sort Method: external merge Disk: 92032kB
-> Seq Scan on posts (cost=0.00..205984.62 rows=3387494 width=12) (actual time=29.542..1954.865 rows=3536790 loops=1)
Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999) AND (project_id = 7))
Rows Removed by Filter: 404218
Total runtime: 23842.532 ms
(8 rows)
Time: 23860.876 ms
Run Code Online (Sandbox Code Playgroud)
这是 seq 扫描,但是我强制使用索引它无济于事:
GroupAggregate (cost=815927.00..856583.47 rows=338804 width=12) (actual time=24634.378..25873.754 rows=1778 loops=1)
-> Sort (cost=815927.00..824397.09 rows=3388039 width=12) (actual time=24634.243..25498.578 rows=3537295 loops=1)
Sort Key: (date_trunc('hour'::text, ('1970-01-01 00:00:00'::timestamp without time zone + ((social_creation_time)::double precision * '00:00:01'::interval)))), network
Sort Method: external merge Disk: 92048kB
-> Bitmap Heap Scan on posts (cost=191020.29..390555.96 rows=3388039 width=12) (actual time=4074.171..5685.734 rows=3537295 loops=1)
Recheck Cond: (project_id = 7)
Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999))
Rows Removed by Filter: 67925
-> Bitmap Index Scan on index_posts_on_project_id_and_network_and_social_id (cost=0.00..190173.29 rows=3617164 width=0) (actual time=4054.817..4054.817 rows=3605225 loops=1)
Index Cond: (project_id = 7)
Total runtime: 25891.215 ms
Run Code Online (Sandbox Code Playgroud)
表中的示例行:
id | project_id | moderation | keyword_id | author_id | created_at | updated_at | server_id | social_creation_time | social_id | network | mood | url | source_id | location | subject_id | conversation_id
---
204202 | 2 | pending | | 125845 | 2014-01-22 15:14:14.786454 | 2014-01-22 15:14:14.786454 | 20620977 | 1390318030 | -64193113_14905 | vkontakte | | https://vk.com/wall-64193113_14905 | 64 | ??????, ?????????? | |
Run Code Online (Sandbox Code Playgroud)
** 更新 **
真正帮助我的是 work_mem 设置为
我的新计划:
HashAggregate (cost=247145.17..254270.53 rows=356268 width=12) (actual time=2564.201..2564.731 rows=1853 loops=1)
-> Seq Scan on posts (cost=0.00..220425.11 rows=3562675 width=12) (actual time=32.916..1914.618 rows=3729876 loops=1)
Filter: (((moderation)::text <> ALL ('{junk,spam}'::text[])) AND (social_creation_time >= 1391716800) AND (social_creation_time <= 1392839999) AND (project_id = 7))
Rows Removed by Filter: 501865
Total runtime: 2566.071 ms
Run Code Online (Sandbox Code Playgroud)
更新 #2 我认为创建一个整数列并保存日期,如 20140220 (YYYMMDD)。stackexchange,你怎么看,它的性能提升?
PS:对不起,我的英语不好
Erw*_*ter 11
除了@Craig 和@dezso 的好建议:
计数为3940689。
然而,您的查询计划说:
Run Code Online (Sandbox Code Playgroud)Seq Scan on posts (cost=0.00..205984.62 rows=**3387494** width=12)
您的计数基于选择:
Run Code Online (Sandbox Code Playgroud)Rows Removed by Filter: 404218
4344907 (3940689 + 404218) >> 3387494。您的统计数据不是最新的。您的autovacuum设置可能有问题,其中包括ANALYZE自动运行。对整体数据库性能非常不利。对于在重试任何操作之前运行的查询:
ANALYZE posts
Run Code Online (Sandbox Code Playgroud)
如果你能负担得起桌子上的锁一段时间,请运行
VACUUM FULL ANALYZE posts
Run Code Online (Sandbox Code Playgroud)
打扫房子。更多在这里。
这些数字表明您的查询使用了大约 90% 的所有行。因此,顺序扫描将比任何可能的索引扫描都快——除了覆盖索引(仅索引扫描)。需要 Postgres 9.2+。请务必阅读有关该主题的Postgres Wiki。
由于您只使用一长列列中的两个小列,因此这样的索引会更小更快。在此过程中,根据您的整体需求,像下面这样的定制索引会挤出最大性能:部分、功能、多列、覆盖索引- 以写入操作为代价:
CREATE INDEX test_idx ON posts (
date_trunc('hour', timestamp 'epoch' + social_creation_time * interval '1 sec')
,network)
WHERE moderation NOT IN ('junk','spam')
AND project_id = 7 -- ??
AND social_creation_time BETWEEN 1391716800 AND 1392839999 -- ??
Run Code Online (Sandbox Code Playgroud)
实际WHERE条件取决于您的实际需求,需要以或多或少相同的形式添加到任何希望使用此索引的查询中。修剪索引中从未使用过的行。仅使用消除多于几行的条件,为您的查询定制超集的行。
通常,覆盖索引适用于相当静态的表。阅读维基。快速测试:
SELECT relallvisible, relpages
FROM pg_class
WHERE oid = 'posts'::regclass
Run Code Online (Sandbox Code Playgroud)
如果relallvisible不是小得多relpages,机会是好的。在尝试此操作之前,请确保 autovacuum 运行正常。
我也会在没有功能方面的情况下进行测试,以查看使用的/更快的:
CREATE INDEX test_idx ON posts (social_creation_time, network)
WHERE moderation NOT IN ('junk','spam')
AND project_id = 7 -- ??
AND social_creation_time BETWEEN 1391716800 AND 1392839999; -- ??
Run Code Online (Sandbox Code Playgroud)
最后,在您的表定义integer和text列交换中,由于数据对齐和填充而使表膨胀很多。此相关答案中的更多内容:
配置 PostgreSQL 以提高读取性能
我会按照以下几行重新创建您的表格:
CREATE TABLE post (
post_id serial PRIMARY KEY,
project_id integer,
created_at timestamp,
updated_at timestamp,
keyword_id integer,
author_id integer,
source_id integer,
subject_id integer,
conversation_id integer,
social_creation_time integer,
server_id text, -- could be integer?
social_id text,
moderation text,
network text,
url text,
location text,
mood text
);
Run Code Online (Sandbox Code Playgroud)
会小一点,有助于整体性能。
为什么text而不是varchar(255)?
Run Code Online (Sandbox Code Playgroud)Sort Method: external merge Disk: 92048kB
work_mem在问题上投入更多。还有很多。尝试:
SET LOCAL work_mem = '300MB';
Run Code Online (Sandbox Code Playgroud)
请注意,如果您在大量并发连接中运行它,则可能会耗尽系统 RAM。所以SET只在个别会话中。
您对聚合的行数估计有点不可靠(http://explain.depesz.com/s/RXbq),但还不错。凶手似乎是那种大人物。
| 归档时间: |
|
| 查看次数: |
19515 次 |
| 最近记录: |