与从CSV文件导出和导入相比,Python MySQLdb SScursor速度较慢.加速可能吗?

Lyc*_*erg 6 python mysql csv file-io mysql-python

作为构建数据仓库的一部分,我必须查询源数据库表大约75M行.

我想对75M行做什么是一些处理,然后将结果添加到另一个数据库.现在,这是非常多的数据,我主要采用两种方法取得了成功:

1)使用MySQL的"SELECT ... INTO"功能将查询导出到CSV文件,并使用python的fileinput模块读取它,

2)使用MySQLdb的SScursor连接MySQL数据库(默认光标将查询放入内存,查杀python脚本)并以大约10k行的块结果获取结果(这是我发现最快的块大小) ).

第一种方法是"手动"执行的SQL查询(大约需要6分钟),然后是读取csv文件并处理它的python脚本.我使用fileinput读取文件的原因是fileinput从一开始就没有将整个文件加载到内存中,并且适用于较大的文件.只需遍历文件(读取文件中的每一行并调用pass)大约需要80秒,即1M行/秒.

第二种方法是执行相同查询的python脚本(也需要大约6分钟,或稍长一些),然后只要在SScursor中有任何左边,就会循环获取行的块.在这里,只需读取行(一个接一个地取出一个块而不做其他任何操作)大约需要15分钟,或大约85k行/秒.

The two numbers (rows/s) above are perhaps not really comparable, but when benchmarking the two approaches in my application, the first one takes about 20 minutes (of which about five is MySQL dumping into a CSV file), and the second one takes about 35 minutes (of which about five minutes is the query being executed). This means that dumping and reading to/from a CSV file is about twice as fast as using an SScursor directly.

This would be no problem, if it did not restrict the portability of my system: a "SELECT ... INTO" statement requires MySQL to have writing privileges, and I suspect that is is not as safe as using cursors. On the other hand, 15 minutes (and growing, as the source database grows) is not really something I can spare on every build.

So, am I missing something? Is there any known reason for SScursor to be so much slower than dumping/reading to/from a CSV file, such that fileinput is C optimized where SScursor is not? Any ideas on how to proceed with this problem? Anything to test? I would belive that SScursor could be as fast as the first approach, but after reading all I can find about the matter, I'm stumped.

Now, to the code:

Not that I think the query is of any problem (it's as fast as I can ask for and takes similar time in both approaches), but here it is for the sake of completeness:

    SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2)
    FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
    ORDER BY LT.someID ASC, LT.weekID ASC;
Run Code Online (Sandbox Code Playgroud)

The primary code in the first approach is something like this

    import fileinput
    INPUT_PATH = 'path/to/csv/dump/dump.csv'
    event_list = []
    ID = -1

    for line in fileinput.input([INPUT_PATH]):
            split_line = line.split(';')
            if split_line[0] == ID:
                event_list.append(split_line[1:])
            else:
                process_function(ID,event_list)
                event_list = [ split_line[1:] ]
                ID = split_line[0]

    process_function(ID,event_list)
Run Code Online (Sandbox Code Playgroud)

The primary code in the second approach is:

    import MySQLdb
    ...opening connection, defining SScursor called ssc...
    CHUNK_SIZE = 100000

    query_stmt = """SELECT LT.SomeID, LT.weekID, W.monday,
                    GREATEST(LT.attr1, LT.attr2)
                    FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
                    ORDER BY LT.someID ASC, LT.weekID ASC"""
    ssc.execute(query_stmt)

    event_list = []
    ID = -1

    data_chunk = ssc.fetchmany(CHUNK_SIZE)
    while data_chunk:
        for row in data_chunk:
            if row[0] == ID:
                event_list.append([ row[1], row[2], row[3] ])
            else:
                process_function(ID,event_list)
                event_list = [[ row[1], row[2], row[3] ]]
                ID = row[0]
        data_chunk = ssc.fetchmany(CHUNK_SIZE)

    process_function(ID,event_list)
Run Code Online (Sandbox Code Playgroud)

At last, I'm on Ubuntu 13.04 with MySQL server 5.5.31. I use Python 2.7.4 with MySQLdb 1.2.3. Thank you for staying with me this long!

Air*_*Air 2

使用后,cProfile我发现很多时间都花在隐式构造 Decimal 对象上,因为那是从 SQL 查询返回到我的 Python 脚本中的数字类型。在第一种方法中,Decimal 值作为整数写入 CSV 文件,然后由 Python 脚本读取。CSV 文件 I/O“扁平化”了数据,使脚本速度更快。现在,这两个脚本的速度大致相同(第二种方法仍然稍慢一些)。

我还将MySQL数据库中的日期转换为整数类型。我的查询现在是:

SELECT LT.SomeID,
       LT.weekID,
       CAST(DATE_FORMAT(W.monday,'%Y%m%d') AS UNSIGNED),
       CAST(GREATEST(LT.attr1, LT.attr2) AS UNSIGNED)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;
Run Code Online (Sandbox Code Playgroud)

这几乎消除了两种方法之间处理时间的差异。

这里的教训是,在进行大型查询时,数据类型的后处理确实很重要!重写查询以减少Python中的函数调用可以显着提高整体处理速度。