并行化pandas pyodbc SQL数据库调用

Question

并行化pandas pyodbc SQL数据库调用

use*_*988 12 python sql multithreading pyodbc pandas

我目前正在通过pandas.io.sql.read_sql()命令将数据查询到数据帧中.我希望并行调用类似于这些人所倡导的调用:(使用Python进行令人尴尬的并行数据库调用(PyData Paris 2015))

像(很一般)的东西:

pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)

Run Code Online (Sandbox Code Playgroud)

有可能吗？

Answer 1

Tri*_*eid 2

是的，这应该可行，但需要注意的是，您需要在您站点的演讲中更改parallel_connection.py。在该代码中，有一个fetchall函数并行执行每个游标，然后组合结果。这是您将要更改的核心内容：

旧代码：

def fetchall(self):
    results = [None] * len(self.cursors)
    def do_work(index, cursor):
        results[index] = cursor.fetchall()
    self._do_parallel(do_work)
    return list(chain(*[rs for rs in results]))

Run Code Online (Sandbox Code Playgroud)

新代码：

def fetchall(self):
    results = [None] * len(self.sql_connections)
    def do_work(index, sql_connection):
        sql, conn = sql_connection  #  Store tuple of sql/conn instead of cursor
        results[index] = pd.read_sql(sql, conn)
    self._do_parallel(do_work)
    return pd.DataFrame().append([rs for rs in results])

Run Code Online (Sandbox Code Playgroud)

仓库： https: //github.com/godatadriven/ParallelConnection

归档时间：	10 年，6 月前
查看次数：	2114 次
最近记录：	10 年，5 月前