The*_*Log 9 sql postgresql ruby-on-rails
我有两个相互链接的表,如下所示:
表answered_questions
具有以下的列和索引:
id
: 首要的关键taken_test_id
:整数(外键)question_id
:integer(外键,指向另一个表的链接questions
)indexes
:(taken_test_id,question_id)表 taken_tests
id
: 首要的关键user_id
:(外键,表用户链接)user_id
列第一个查询(使用EXPLAIN ANALYZE输出):
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id"
WHERE
"taken_tests"."user_id" = 1;
Run Code Online (Sandbox Code Playgroud)
输出:
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.276 ms
Execution time: 2.365 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)
另一个查询(这是在使用joins
ActiveRecord中的方法时由Rails自动生成的)
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Run Code Online (Sandbox Code Playgroud)
这是输出
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.302 ms
Execution time: 1258.035 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)
唯一的区别是INNER JOIN条件中的列顺序.在第一个查询中,它是ON "answered_questions"."taken_test_id" = "taken_tests"."id"
在第二个查询中,它是ON "taken_tests"."id" = "answered_questions"."taken_test_id"
.但查询时间差异很大.
你知道为什么会这样吗?我读了一些文章,它说JOIN条件中列的顺序不应该影响执行时间(例如:sql join中连接列顺序的最佳实践?)
我正在使用Postgres 9.6.有超过40万行answered_questions
表,并在超过300万行的taken_tests
表
当我运行EXPLAIN时(analyze true, verbose true, buffers true)
,我得到了更好的第二个查询结果(非常类似于第一个查询)
EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE)
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Run Code Online (Sandbox Code Playgroud)
产量
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Buffers: shared hit=1986
-> Index Scan using index_taken_tests_on_user_id on public.taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
Output: taken_tests.id
Index Cond: (taken_tests.user_id = 1)
Buffers: shared hit=269
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Index Cond: (answered_questions.taken_test_id = taken_tests.id)
Buffers: shared hit=1717
Planning time: 0.238 ms
Execution time: 2.335 ms
Run Code Online (Sandbox Code Playgroud)
正如您从初始EXPLAIN ANALYZE
语句结果中看到的那样——查询产生了等效的查询计划,并且执行方式完全相同。
差异来自于同一单元的执行时间:
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (
实际时间=0.014..0.483rows=371 loops=1)
和
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (
实际时间=10.451..71.474rows=371 loops=1)
正如评论者已经指出的那样(请参阅 wuestion 评论中的文档链接),无论表顺序如何,内部联接的查询计划都应该是相同的。它是根据查询规划器的决策进行排序的。这意味着您应该真正关注查询执行的其他性能优化部分。其中之一是用于缓存的内存 ( SHARED BUFFER
)。看起来查询结果很大程度上取决于该数据是否已经加载到内存中。正如您所注意到的——等待一段时间后,查询执行时间会增加。这清楚地表明缓存过期问题比计划问题更严重。增加共享缓冲区的大小可能有助于解决这个问题,但查询的初始执行总是会花费更长的时间——这只是您的磁盘访问速度。
有关 Pg 数据库内存配置的更多提示,请参阅此处: https: //wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
注意:VACUUM 或 ANALYZE 命令在这里不太可能有帮助。两个查询已经使用相同的计划。但请记住,由于 PostgreSQL 事务隔离机制 (MVCC),它可能必须读取基础表行,以验证它们在从索引获取结果后仍然对当前事务可见。这可以通过更新可见性图来改进(参见https://www.postgresql.org/docs/10/storage-vm.html),这是在清理过程中完成的。