gue*_*tli 6 sql postgresql performance disk-io database-performance
如果执行以下数据库(postgres)查询,则第二次调用要快得多.
我想第一个查询很慢,因为操作系统(linux)需要从磁盘获取数据.第二个查询受益于文件系统级别和postgres中的缓存.
有没有办法优化数据库,以便在第一次调用时快速获得结果?
第一次通话(慢)
foo3_bar_p@BAR-FOO3-Test:~$ psql
foo3_bar_p=# explain analyze SELECT "foo3_beleg"."id", ... FROM "foo3_beleg" WHERE
foo3_bar_p-# (("foo3_beleg"."id" IN (SELECT beleg_id FROM foo3_text where
foo3_bar_p(# content @@ 'footown'::tsquery)) AND "foo3_beleg"."belegart_id" IN
foo3_bar_p(# ('...', ...));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=75314.58..121963.20 rows=152 width=135) (actual time=27253.451..88462.165 rows=11 loops=1)
-> HashAggregate (cost=75314.58..75366.87 rows=5229 width=4) (actual time=16087.345..16113.988 rows=17671 loops=1)
-> Bitmap Heap Scan on foo3_text (cost=273.72..75254.67 rows=23964 width=4) (actual time=327.653..16026.787 rows=27405 loops=1)
Recheck Cond: (content @@ '''footown'''::tsquery)
-> Bitmap Index Scan on foo3_text_content_idx (cost=0.00..267.73 rows=23964 width=0) (actual time=281.909..281.909 rows=27405 loops=1)
Index Cond: (content @@ '''footown'''::tsquery)
-> Index Scan using foo3_beleg_pkey on foo3_beleg (cost=0.00..8.90 rows=1 width=135) (actual time=4.092..4.092 rows=0 loops=17671)
Index Cond: (id = foo3_text.beleg_id)
Filter: ((belegart_id)::text = ANY ('{...
Rows Removed by Filter: 1
Total runtime: 88462.809 ms
(11 rows)
Run Code Online (Sandbox Code Playgroud)
二次通话(快速)
Nested Loop (cost=75314.58..121963.20 rows=152 width=135) (actual time=127.569..348.705 rows=11 loops=1)
-> HashAggregate (cost=75314.58..75366.87 rows=5229 width=4) (actual time=114.390..133.131 rows=17671 loops=1)
-> Bitmap Heap Scan on foo3_text (cost=273.72..75254.67 rows=23964 width=4) (actual time=11.961..97.943 rows=27405 loops=1)
Recheck Cond: (content @@ '''footown'''::tsquery)
-> Bitmap Index Scan on foo3_text_content_idx (cost=0.00..267.73 rows=23964 width=0) (actual time=9.226..9.226 rows=27405 loops=1)
Index Cond: (content @@ '''footown'''::tsquery)
-> Index Scan using foo3_beleg_pkey on foo3_beleg (cost=0.00..8.90 rows=1 width=135) (actual time=0.012..0.012 rows=0 loops=17671)
Index Cond: (id = foo3_text.beleg_id)
Filter: ((belegart_id)::text = ANY ('...
Rows Removed by Filter: 1
Total runtime: 348.833 ms
(11 rows)
Run Code Online (Sandbox Code Playgroud)
foo3_text表的表格布局(28M行)
foo3_egs_p=# \d foo3_text
Table "public.foo3_text"
Column | Type | Modifiers
----------+-----------------------+------------------------------------------------------------
id | integer | not null default nextval('foo3_text_id_seq'::regclass)
beleg_id | integer | not null
index_id | character varying(32) | not null
value | text | not null
content | tsvector |
Indexes:
"foo3_text_pkey" PRIMARY KEY, btree (id)
"foo3_text_index_id_2685e3637668d5e7_uniq" UNIQUE CONSTRAINT, btree (index_id, beleg_id)
"foo3_text_beleg_id" btree (beleg_id)
"foo3_text_content_idx" gin (content)
"foo3_text_index_id" btree (index_id)
"foo3_text_index_id_like" btree (index_id varchar_pattern_ops)
Foreign-key constraints:
"beleg_id_refs_id_6e6d40770e71292" FOREIGN KEY (beleg_id) REFERENCES foo3_beleg(id) DEFERRABLE INITIALLY DEFERRED
"index_id_refs_name_341600137465c2f9" FOREIGN KEY (index_id) REFERENCES foo3_index(name) DEFERRABLE INITIALLY DEFERRED
Run Code Online (Sandbox Code Playgroud)
可以进行硬件更改(SSD而不是传统磁盘)或RAM磁盘.但也许当前的硬件也可以做得更快.
版本:x86_64-unknown-linux-gnu上的PostgreSQL 9.1.2
如果您需要更多详细信息,请发表评论.
有时将“WHERE x IN”移至 JOIN 可以显着提高性能。尝试这个:
SELECT
foo3_beleg.id, ...
FROM
foo3_beleg b INNER JOIN
foo3_text t ON (t.beleg_id = b.id AND t.content @@ 'footown'::tsquery)
WHERE
foo3_beleg.belegart_id IN ('...', ...);
Run Code Online (Sandbox Code Playgroud)
这是一个可重复的实验来支持我的主张。
我碰巧手头有一个大型 Postgres 数据库(3000 万行)(http://juliusdavies.ca/2013/j.emse/bertillonage/),所以我将其加载到 postgres 9.4beta3 中。
结果令人印象深刻。子选择方法大约慢 20 倍:
time psql myDb < using-in.sql
real 0m17.212s
time psql myDb < using-join.sql
real 0m0.807s
Run Code Online (Sandbox Code Playgroud)
对于那些对复制感兴趣的人,这里是我用来测试我的理论的原始 SQL 查询。
此查询使用“SELECT IN”子查询,速度慢了 20 倍(在我的笔记本电脑上第一次执行需要 17 秒):
-- using-in.sql
SELECT
COUNT(DISTINCT sigsha1re) AS a_intersect_b, infilesha1
FROM
files INNER JOIN sigs ON (files.filesha1 = sigs.filesha1)
WHERE
sigs.sigsha1re IN (
SELECT sigsha1re FROM sigs WHERE sigs.sigsha1re like '0347%'
)
GROUP BY
infilesha1
Run Code Online (Sandbox Code Playgroud)
此查询将条件从子查询移入连接条件,并且速度快了 20 倍(在我的笔记本电脑上,第一次执行为 0.8 秒)。
-- using-join.sql
SELECT
COUNT(DISTINCT sigsha1re) AS a_intersect_b, infilesha1
FROM
files INNER JOIN sigs ON (
files.filesha1 = sigs.filesha1 AND sigs.sigsha1re like '0347%'
)
GROUP BY
infilesha1
Run Code Online (Sandbox Code Playgroud)
ps 如果您好奇该数据库的用途,您可以使用它来计算任意 jar 文件与 2011 年左右 maven 存储库中的所有 jar 文件的相似程度。
./query.sh lib/commons-codec-1.5.jar | psql myDb
similarity | a = 39 = commons-codec-1.5.jar (bin2bin)
------------+--------------------------------------------------------------------------------------
1.000 | commons-codec-1.5.jar
0.447 | commons-codec-1.4.jar
0.174 | org.apache.sling.auth.form-1.0.2.jar
0.170 | org.apache.sling.auth.form-1.0.0.jar
0.142 | jbehave-core-3.0-beta-3.jar
0.142 | jbehave-core-3.0-beta-4.jar
0.141 | jbehave-core-3.0-beta-5.jar
0.141 | jbehave-core-3.0-beta-6.jar
0.140 | commons-codec-1.2.jar
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2006 次 |
最近记录: |