为什么numpy/pandas解析长行的csv文件这么慢?

dsh*_*erd 5 python csv parsing numpy pandas

我正在尝试有效地解析一个csv文件,每行大约20,000个条目(和几千行)到一个numpy数组(或数组列表,或者真正的类似).我发现了许多其他问题,以及这篇博文,其中表明pandas的csv解析器非常快.然而,我已经对pandas,numpy和一些纯python方法进行了基准测试,看起来琐碎的纯python字符串拆分+列表理解比其他所有方法都要大得多.

  • 这里发生了什么?

  • 是否有更高效的csv解析器?

  • 如果我改变输入数据的格式会有帮助吗?

这是我sum()正在进行基准测试的源代码(这只是为了确保任何惰性迭代器都被强制评估所有内容):

#! /usr/bin/env python3

import sys

import time
import gc

import numpy as np
from pandas.io.parsers import read_csv
import csv

def python_iterator_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(float(x) for x in all_data))


def python_list_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum([float(x) for x in all_data]))


def python_array_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(np.array([float(x) for x in all_data])))


def numpy_fromstring():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            print(sum(np.fromstring(line, sep = ",")))


def numpy_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.loadtxt(f, delimiter = ",", dtype = np.float, ndmin = 2):
            print(sum(row))


def csv_loader(csvfile):
    return read_csv(csvfile,
                      header = None,
                      engine = "c",
                      na_filter = False,
                      quoting = csv.QUOTE_NONE,
                      index_col = False,
                      sep = ",")

def pandas_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.asarray(csv_loader(f).values, dtype = np.float64):
            print(sum(row))


def pandas_csv_2():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        print(csv_loader(f).sum(axis=1))


def simple_time(func, repeats = 3):
    gc.disable()

    for i in range(0, repeats):
        start = time.perf_counter()
        func()
        end = time.perf_counter()
        print(func, end - start, file = sys.stderr)
        gc.collect()

    gc.enable()
    return


if __name__ == "__main__":

    simple_time(python_iterator_csv)
    simple_time(python_list_csv)
    simple_time(python_array_csv)
    simple_time(numpy_csv)
    simple_time(pandas_csv)
    simple_time(numpy_fromstring)

    simple_time(pandas_csv_2)
Run Code Online (Sandbox Code Playgroud)

输出(到stderr)是:

<function python_iterator_csv at 0x7f22302b1378> 19.754893831999652
<function python_iterator_csv at 0x7f22302b1378> 19.62786615600271
<function python_iterator_csv at 0x7f22302b1378> 19.66641107099713

<function python_list_csv at 0x7f22302b1ae8> 18.761991592000413
<function python_list_csv at 0x7f22302b1ae8> 18.722911622000538
<function python_list_csv at 0x7f22302b1ae8> 19.00348913199923

<function python_array_csv at 0x7f222baffa60> 41.8681991630001
<function python_array_csv at 0x7f222baffa60> 42.141840383999806
<function python_array_csv at 0x7f222baffa60> 41.86879085799956

<function numpy_csv at 0x7f222ba5cc80> 47.957625758001086
<function numpy_csv at 0x7f222ba5cc80> 47.245571732000826
<function numpy_csv at 0x7f222ba5cc80> 47.25457685799847

<function pandas_csv at 0x7f2228572620> 43.39656048499819
<function pandas_csv at 0x7f2228572620> 43.5016079220004
<function pandas_csv at 0x7f2228572620> 43.567352316000324

<function numpy_fromstring at 0x7f593ed3cc80> 32.490607361
<function numpy_fromstring at 0x7f593ed3cc80> 32.421125410997774
<function numpy_fromstring at 0x7f593ed3cc80> 32.37903898300283

<function pandas_csv_2 at 0x7f846d1aa730> 24.903284349999012
<function pandas_csv_2 at 0x7f846d1aa730> 25.498485038999206
<function pandas_csv_2 at 0x7f846d1aa730> 25.03262125800029
Run Code Online (Sandbox Code Playgroud)

从上面链接的博客文章看,大熊猫似乎可以以145/1.279502= 113 MB/s 的数据速率导入随机双打的csv矩阵.我的文件是814 MB,所以pandas只为我管理~19 MB/s!

编辑:正如@ASGM指出的那样,这对熊猫来说并不公平,因为它不是为了进行rowise迭代而设计的.我已经在基准测试中包含了建议的改进,但它仍然比纯python方法慢.(另外:在将其简化为此基准测试之前,我已经使用了类似代码的分析,并且解析总是占据所花费的时间.)

edit2:最好的三次没有sum:

python_list_csv    17.8
python_array_csv   23.0
numpy_csv          28.6
numpy_fromstring   13.3
pandas_csv_2       24.2
Run Code Online (Sandbox Code Playgroud)

所以没有总和比numpy.fromstring纯粹的python差一点(我认为fromstring用C语写,所以这是有道理的).

EDIT3:

我在这里用C/C++ float解析代码做了一些实验,看起来我可能对pandas/numpy有太多的期待.那里列出的大多数强大的解析器只需要10秒以上的时间来解析这个浮点数.响亮的唯一解析器numpy.fromstring是boost spirit::qi,它是C++,所以不太可能进入任何python库.

[更精确的结果:spirit::qi〜3S,lexical_cast〜7S,atofstrtod〜10S, sscanf〜18,stringstream并且stringstream reused是在50s和28S慢得令人难以置信.]

ali*_*i_m 10

您的CSV文件是否包含列标题?如果没有,那么明确地传递header=Nonepandas.read_csv可以给Python的解析引擎有轻微的性能提升(但不是为C引擎):

In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')

In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop

In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop

In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop

In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop
Run Code Online (Sandbox Code Playgroud)

更新

如果没有丢失或无效的值,那么你可以通过传递na_filter=False(仅对C引擎有效)做得更好:

In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop

In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop
Run Code Online (Sandbox Code Playgroud)

通过明确指定可能会有小的收益dtype:

In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop
Run Code Online (Sandbox Code Playgroud)

更新2

继续@ morningsun的评论,设置low_memory=False挤出更快的速度:

In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop

In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop
Run Code Online (Sandbox Code Playgroud)

对于它的价值,这些基准测试都是使用当前开发版的pandas(0.16.0-19-g8d2818e)完成的.