Pandas SQL chunksize

Question

Pandas SQL chunksize

Nit*_*mar 18 python sql-server chunks pandas

这更像是一个关于理解而不是编程的问题.我是Pandas和SQL的新手.我正在使用pandas从SQL中读取一些特定的chunksize数据.当我运行SQL查询时,例如将pandas导入为pd

df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)

Run Code Online (Sandbox Code Playgroud)

我不明白的是,当我不给出一个chunksize时,数据存储在内存中我可以看到内存增长然而,当我给出一个chunksize时,内存使用率并不高.

我有的是,这个df现在包含了许多我可以访问的数组

for df_array in df:
    print df.head(5)

Run Code Online (Sandbox Code Playgroud)

我不明白的是,如果SQL语句的整个结果保存在内存中,即df是一个携带多个数组的对象,或者它们就像是指向由SQL查询创建的临时表的指针.

我很乐意对这个过程的实际运作方式有所了解.

Answer 1

小智 25

Let's consider two options and what happens in both cases:

chunksize is None(default value):
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize is None
- pandas tells database that it wants to receive all rows of the result table at once
- database returns all rows of the result table
- pandas stores the result table in memory and wraps it into a data frame
- now you can use the data frame
chunksize in not None:
- pandas将查询传递给数据库
- 数据库执行查询
- 熊猫检查并发现chunksize有一些价值
- pandas创建一个查询迭代器(通常'while True'循环,当数据库表示没有剩余数据时会中断)并在每次需要结果表的下一个块时迭代它
- pandas告诉数据库它想要接收chunksize行
- database返回结果表中的下一个chunksize行
- pandas将下一个chunksize行存储在内存中并将其包装到数据框中
- 现在你可以使用数据框了

有关详细信息,您可以看到pandas\io\sql.py模块,它有详细记录

Answer 2

jor*_*ris 20

如果不提供a chunksize,则查询的完整结果将立即放入数据框中.

当您提供a时chunksize,返回值read_sql_query是多个数据帧的迭代器.这意味着你可以像这样迭代:

for df in result:
    print df

Run Code Online (Sandbox Code Playgroud)

并且在每个步骤中df都是一个数据框(不是数组!),它保存查询的一部分数据.请参阅以下文档:http://pandas.pydata.org/pandas-docs/stable/io.html#querying

要回答有关内存的问题,您必须知道从数据库中检索数据有两个步骤:execute和fetch.
首先执行查询(result = con.execute()),然后从该结果集中获取数据作为元组(data = result.fetch())的列表.提取时,您可以指定一次要获取的行数.这就是大熊猫提供的时候所做的事情chunksize.
但是,许多数据库驱动程序已经在执行步骤中将所有数据放入内存,而不仅仅是在获取数据时.所以在这方面,记忆无关紧要.除了将数据复制到DataFrame之外,只有在迭代时才会在不同的步骤中进行chunksize.

Answer 3

小智 7

它基本上是为了在您进行大量查询时阻止服务器内存不足。

输出为 CSV

for chunk in pd.read_sql_query(sql , con, chunksize=10000):
    chunk.to_csv(os.path.join(tablename + ".csv"), mode='a',sep=',',encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

或使用镶木地板

count = 0
folder_path = 'path/to/output'

for chunk in pd.read_sql_query(sql , con, chunksize=10000):
    file_path = folder_path + '/part.%s.parquet' % (count)
    chunk.to_parquet(file_path, engine='pyarrow')
    count += 1

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	19525 次
最近记录：	7 年，8 月前