在大型数据集上sum(column_name),sum(1)和count(*)之间存在巨大的性能差异

Question

在大型数据集上sum(column_name),sum(1)和count(*)之间存在巨大的性能差异

编辑:
既然你们建议为玩家/锦标赛名称创建单独的表并用外键替换字符串,我做了以下内容:

SELECT DISTINCT tournament INTO tournaments FROM chess_data2
ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY

Run Code Online (Sandbox Code Playgroud)

我为namew和nameb重复了一遍,然后用外键替换字符串.这是它变得棘手的地方 - 我无法在"合法"的时间内完成它.

我尝试了以下两种方法:
1)删除现有索引
1)分别为namew,nameb和锦标赛创建单独的索引
1)运行将我想要的数据插入新表的查询:

SELECT date, whiterank, blackrank, t_round, result,
(SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep,
(SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp,
(SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament
INTO final_chess from chess_data2 c_d2

Run Code Online (Sandbox Code Playgroud)

不幸的是它很慢,所以我回到了用户Boris Shchegolev.在评论中,他建议在现有的chess_data2表中创建一个新列并进行更新.所以我做到了:

ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER
UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)"

Run Code Online (Sandbox Code Playgroud)

半个小时前我开始查询,第一个是即时的,但第二个是永远的.

我现在应该怎么做呢？

初步问题:

数据库架构:
日期DATE
namew TEXT
nameb TEXT whiterank INTEGER
blackrank INTEGER
锦标赛TEXT
t_round INTEGER
结果REAL
id BIGINT
chess_data2_pkey(id)
black_index(nameb,锦标赛,日期)
chess_data2_pkey(id)UNIQUE
w_b_t_d_index(namew,nameb,锦标赛,日期)
white_index( namew,锦标赛,日期)

问题:
以下语句的表现非常好(在包含3百万条目的数据库中约60-70秒):

# Number of points that the white player has so far accrued throughout the tournament
(SELECT coalesce(SUM(result),0) from chess_data2 t2
where (t1.namew = t2.namew) and t1.tournament = t2.tournament
and t1.date > t2.date  and t1.date < t2.date + 90)
+ SELECT coalesce(SUM(1-result),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w
from chessdata2 t1

Run Code Online (Sandbox Code Playgroud)

同时,以下选择(其中具有完全相同的where子句)将永远计算.

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(count(*),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1

Run Code Online (Sandbox Code Playgroud)

我尝试了一种不同的方法(总和)并且它也没有变得更好:

# Number of games that the white player has so far played in the tournament
(SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(sum(1),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1

Run Code Online (Sandbox Code Playgroud)

知道这里发生了什么以及如何解决这个问题？我在PyCharm中使用python 3.5和psycopg2来运行这些查询.我将非常乐意提供任何其他信息,因为这对我来说是一个非常重要的项目.
EXPLAIN ANALYZE(用于上次查询):

Seq Scan on chess_data2 t1  (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1)
Planning time: 0.161 ms
Execution time: 81883.716 ms
SubPlan 2
SubPlan 1
->  Aggregate  (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
->  Aggregate  (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
      ->  Index Only Scan using white_index on chess_data2 t2  (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185)
      ->  Index Only Scan using black_index on chess_data2 t2_1  (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185)
            Rows Removed by Filter: 1
            Rows Removed by Filter: 1
            Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
            Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
            Heap Fetches: 6009767
            Heap Fetches: 5303160
            Filter: (t1.date < (date + 90))
            Filter: (t1.date < (date + 90))

Run Code Online (Sandbox Code Playgroud)

Answer 1

Bor*_*lev 3

由于表设计不佳，查询性能不佳。从EXPLAIN中可以明显看出数据库使用了索引，但是索引的字段都很全TEXT，而且索引很大。

要解决这个问题：

创建表names
namew将and替换nameb为namew_idand nameb_id，两者都引用names
创建表tournaments
替换tournament为tournament_id引用tournaments
重新索引black_index为(nameb_id, tournament_id, date)
重新索引white_index为(namew_id, tournament_id, date)
w_b_t_d_index除非您在其他查询中使用它，否则删除
coalesce从count(*)查询中删除无用的内容

您的查询应该如下所示：

SELECT
    (
        SELECT count(*)
        FROM chess_data t2 
        WHERE
            t1.namew_id = t2.namew_id AND
            t1.tournament_id = t2.tournament_id AND
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    )
    +
    (
        SELECT count(*)
        FROM chess_data2 t2
        WHERE 
            t1.namew_id = t2.nameb_id AND
            t1.tournament_id = t2.tournament_id AND 
            t1.date > t2.date AND 
            t1.date < t2.date + 90
    ) AS games_t_w
FROM chess_data2 t1

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，5 月前
查看次数：	167 次
最近记录：	9 年，5 月前