熊猫中几个数据帧的异步“read_csv”-为什么不更快

Yeh*_*ens 3 python async-await pandas

我想创建一个异步读取多个 Pandas 数据帧的代码,例如从 CSV 文件(或从数据库)

我编写了以下代码,假设它应该更快地导入两个数据帧,但它似乎执行得更慢:

import timeit

import pandas as pd
import asyncio

train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})

train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')


async def run_async_train():
    return pd.read_csv('train.csv')

async def run_async_test():
    return pd.read_csv('test.csv')

async def run_train_test_asinc():
    df = await asyncio.gather(run_async_train(), run_async_test())
    return df

start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async

start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start

print(time_to_run_async<time_to_run_without_async)
Run Code Online (Sandbox Code Playgroud)

为什么它在非异步版本中读取两个数据帧的速度更快?

只是为了清楚起见,我真的要从中读取数据,Bigquery所以我真的很想使用上面的代码来加速两个请求(训练和测试)。

提前致谢!

Min*_*uel 5

pd.read_csv不是异步方法,所以我不相信您实际上从中获得了任何并行性。您需要使用异步文件库,例如aiofiles将文件异步读入缓冲区,然后将它们发送到pd.read_csv(.).

请注意,大多数文件系统并不是真正的异步,因此aiofiles在功能上是一个线程池。然而,它仍然可能比串行读取更快。


这是我aiohttp从 url 获取 csvs的示例:

import io
import asyncio

import aiohttp
import pandas as pd

async def get_csv_async(client, url):
    # Send a request.
    async with client.get(url) as response:
        # Read entire resposne text and convert to file-like using StringIO().
        with io.StringIO(await response.text()) as text_io:
            return pd.read_csv(text_io)

async def get_all_csvs_async(urls):
    async with aiohttp.ClientSession() as client:
        # First create all futures at once.
        futures = [ get_csv_async(client, url) for url in urls ]
        # Then wait for all the futures to complete.
        return await asyncio.gather(*futures)

urls = [
    # Some random CSV urls from the internet
    'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]

if '__main__' == __name__:
    # Run event loop
    # can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
    csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))

    for csv in csvs:
        print(csv)
Run Code Online (Sandbox Code Playgroud)