Yeh*_*ens 3 python async-await pandas
我想创建一个异步读取多个 Pandas 数据帧的代码,例如从 CSV 文件(或从数据库)
我编写了以下代码,假设它应该更快地导入两个数据帧,但它似乎执行得更慢:
import timeit
import pandas as pd
import asyncio
train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})
train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')
async def run_async_train():
return pd.read_csv('train.csv')
async def run_async_test():
return pd.read_csv('test.csv')
async def run_train_test_asinc():
df = await asyncio.gather(run_async_train(), run_async_test())
return df
start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async
start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start
print(time_to_run_async<time_to_run_without_async)
Run Code Online (Sandbox Code Playgroud)
为什么它在非异步版本中读取两个数据帧的速度更快?
只是为了清楚起见,我真的要从中读取数据,Bigquery所以我真的很想使用上面的代码来加速两个请求(训练和测试)。
提前致谢!
pd.read_csv不是异步方法,所以我不相信您实际上从中获得了任何并行性。您需要使用异步文件库,例如aiofiles将文件异步读入缓冲区,然后将它们发送到pd.read_csv(.).
请注意,大多数文件系统并不是真正的异步,因此aiofiles在功能上是一个线程池。然而,它仍然可能比串行读取更快。
这是我aiohttp从 url 获取 csvs的示例:
import io
import asyncio
import aiohttp
import pandas as pd
async def get_csv_async(client, url):
# Send a request.
async with client.get(url) as response:
# Read entire resposne text and convert to file-like using StringIO().
with io.StringIO(await response.text()) as text_io:
return pd.read_csv(text_io)
async def get_all_csvs_async(urls):
async with aiohttp.ClientSession() as client:
# First create all futures at once.
futures = [ get_csv_async(client, url) for url in urls ]
# Then wait for all the futures to complete.
return await asyncio.gather(*futures)
urls = [
# Some random CSV urls from the internet
'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]
if '__main__' == __name__:
# Run event loop
# can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))
for csv in csvs:
print(csv)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1669 次 |
| 最近记录: |