在 SQLite 数据库中插入数百万行,Python 太慢

Luk*_*ord 1 python sqlite chess

对于我的国际象棋引擎,我使用统计数据来选择最佳动作。我从数百万个游戏中收集了它们。我对当前棋步下一步棋以及当前棋步下棋的次数感兴趣。

对于使用 Python 字典并用 pickle 存储它,文件太大,并且很难用新游戏更新。所以我决定使用 SQLite。

我创建了一个类MovesDatabase

class MovesDatabase:

def __init__(self, work_dir):
    self.con = sqlite3.connect(os.path.join(work_dir, "moves.db"))
    self.con.execute('PRAGMA temp_store = MEMORY')
    self.con.execute('PRAGMA synchronous = NORMAL')
    self.con.execute('PRAGMA journal_mode = WAL')
    self.cur = self.con.cursor()

    self.cur.execute("CREATE TABLE IF NOT EXISTS moves("
                     "move TEXT,"
                     "next TEXT,"
                     "count INTEGER DEFAULT 1);")
Run Code Online (Sandbox Code Playgroud)

movenext以字符串格式表示棋盘的状态:FEN。例子:

  • rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPPP/RNBQKBNR
  • r1b1k1nr/p2p1pNp/n2B4/1p1NP2P/6P1/3P1Q2/P1P1K3/q5b1
  • 8/8/8/4p1K1/2k1P3/8/8/8b

下面的方法负责获取游戏文件,提取动作并插入(如果 ( move, next) 是新的),或者更新(如果 ( move, next) 已存在于数据库中:

def insert_moves_from_file(self, file: str):
    print("Extracting moves to database from " + file)

    count = 0

    with open(file) as games_file:
        game = chess.pgn.read_game(games_file)

        while game is not None:
            batch = []
            board = game.board()
            state_one = board.fen().split(' ')[0] + ' ' + board.fen().split(' ')[1]

            for move in game.mainline_moves():
                board.push(move)
                fen = board.fen().split(' ')
                state_two = fen[0] + ' ' + fen[1]

                res = self.cur.execute("SELECT * FROM moves WHERE move=? AND next=?",
                                       (state_one, state_two))
                res = res.fetchall()

                if len(res) != 0:
                    self.cur.execute("UPDATE moves SET count=count+1 WHERE move=? AND next=?",
                                     (state_one, state_two))
                else:
                    batch.append((state_one, state_two))

                state_one = state_two

            self.cur.executemany("INSERT INTO moves(move, next) VALUES"
                                 "(?, ?)", batch)
            count += 1
            print('\r' "%d games was add to the database.." % (count + 1), end='')
            game = chess.pgn.read_game(games_file)

    self.con.commit()
    print("\n Finished!")
Run Code Online (Sandbox Code Playgroud)

move( , next)这对夫妇是独一无二的。

我测试了一个包含大约 400 万个 ( move, next) 的文件。它开始以 3.000 行/秒的速度插入/更新,但随着 50K 行的增加,速度减慢至 100 行/秒,并持续下降。我设计这个方法是为了处理多个游戏文件,这就是我首先选择 SQL 数据库的原因。

AKX*_*AKX 5

INSERT这里的速度并不慢。

您的movenext列没有索引,因此任何SELECTUPDATE涉及这些列都需要全表扫描。

如果(move, next)始终是唯一的,您将需要UNIQUE在其上添加索引。move它还会自动使查询/对的查询next更快(但不一定是那些只查询这两列之一的查询)。

要在现有表上创建该索引,

CREATE UNIQUE INDEX ix_move_next ON moves (move, next);
Run Code Online (Sandbox Code Playgroud)

最后,一旦你有了这个索引,你就可以通过更新插入删除整个SELECT/东西:UPDATE

INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count + 1;
Run Code Online (Sandbox Code Playgroud)

这里有一个轻微的重构,可以在我的机器上实现大约 6200 次移动/秒。(它需要tqdm一个漂亮的进度条库和一个pgns/包含 PGN 文件的目录。)

import glob
import sqlite3
import chess.pgn
import tqdm
from chess import WHITE


def board_to_state(board):
    # These were extracted from the implementation of `board.fen()`
    # so as to avoid doing extra work we don't need.
    bfen = board.board_fen(promoted=False)
    turn = ("w" if board.turn == WHITE else "b")
    return f'{bfen} {turn}'


def insert_game(cur, game):
    batch = []
    board = game.board()
    state_one = board_to_state(board)
    for move in game.mainline_moves():
        board.push(move)
        state_two = board_to_state(board)
        batch.append((state_one, state_two))
        state_one = state_two
    cur.executemany("INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count + 1", batch)
    n_moves = len(batch)
    return n_moves


def main():
    con = sqlite3.connect("moves.db")
    con.execute('PRAGMA temp_store = MEMORY')
    con.execute('PRAGMA synchronous = NORMAL')
    con.execute('PRAGMA journal_mode = WAL')
    con.execute('CREATE TABLE IF NOT EXISTS moves(move TEXT,next TEXT,count INTEGER DEFAULT 1);')
    con.execute('CREATE UNIQUE INDEX IF NOT EXISTS ix_move_next ON moves (move, next);')

    cur = con.cursor()

    for pgn_file in sorted(glob.glob("pgns/*.pgn")):
        with open(pgn_file) as games_file:
            n_games = 0
            with tqdm.tqdm(desc=pgn_file, unit="moves") as pbar:
                while (game := chess.pgn.read_game(games_file)):
                    n_moves = insert_game(cur, game)
                    n_games += 1
                    pbar.set_description(f"{pgn_file} ({n_games} games)", refresh=False)
                    pbar.update(n_moves)
            con.commit()


if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)