dsh*_*erd 5 python csv parsing numpy pandas
我正在尝试有效地解析一个csv文件,每行大约20,000个条目(和几千行)到一个numpy数组(或数组列表,或者真正的类似).我发现了许多其他问题,以及这篇博文,其中表明pandas的csv解析器非常快.然而,我已经对pandas,numpy和一些纯python方法进行了基准测试,看起来琐碎的纯python字符串拆分+列表理解比其他所有方法都要大得多.
这里发生了什么?
是否有更高效的csv解析器?
如果我改变输入数据的格式会有帮助吗?
这是我sum()正在进行基准测试的源代码(这只是为了确保任何惰性迭代器都被强制评估所有内容):
#! /usr/bin/env python3
import sys
import time
import gc
import numpy as np
from pandas.io.parsers import read_csv
import csv
def python_iterator_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for line in f.readlines():
all_data = line.strip().split(",")
print(sum(float(x) for x in all_data))
def python_list_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for line in f.readlines():
all_data = line.strip().split(",")
print(sum([float(x) for x in all_data]))
def python_array_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for line in f.readlines():
all_data = line.strip().split(",")
print(sum(np.array([float(x) for x in all_data])))
def numpy_fromstring():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for line in f.readlines():
print(sum(np.fromstring(line, sep = ",")))
def numpy_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for row in np.loadtxt(f, delimiter = ",", dtype = np.float, ndmin = 2):
print(sum(row))
def csv_loader(csvfile):
return read_csv(csvfile,
header = None,
engine = "c",
na_filter = False,
quoting = csv.QUOTE_NONE,
index_col = False,
sep = ",")
def pandas_csv():
with open("../data/temp_fixed_l_no_initial", "r") as f:
for row in np.asarray(csv_loader(f).values, dtype = np.float64):
print(sum(row))
def pandas_csv_2():
with open("../data/temp_fixed_l_no_initial", "r") as f:
print(csv_loader(f).sum(axis=1))
def simple_time(func, repeats = 3):
gc.disable()
for i in range(0, repeats):
start = time.perf_counter()
func()
end = time.perf_counter()
print(func, end - start, file = sys.stderr)
gc.collect()
gc.enable()
return
if __name__ == "__main__":
simple_time(python_iterator_csv)
simple_time(python_list_csv)
simple_time(python_array_csv)
simple_time(numpy_csv)
simple_time(pandas_csv)
simple_time(numpy_fromstring)
simple_time(pandas_csv_2)
Run Code Online (Sandbox Code Playgroud)
输出(到stderr)是:
<function python_iterator_csv at 0x7f22302b1378> 19.754893831999652
<function python_iterator_csv at 0x7f22302b1378> 19.62786615600271
<function python_iterator_csv at 0x7f22302b1378> 19.66641107099713
<function python_list_csv at 0x7f22302b1ae8> 18.761991592000413
<function python_list_csv at 0x7f22302b1ae8> 18.722911622000538
<function python_list_csv at 0x7f22302b1ae8> 19.00348913199923
<function python_array_csv at 0x7f222baffa60> 41.8681991630001
<function python_array_csv at 0x7f222baffa60> 42.141840383999806
<function python_array_csv at 0x7f222baffa60> 41.86879085799956
<function numpy_csv at 0x7f222ba5cc80> 47.957625758001086
<function numpy_csv at 0x7f222ba5cc80> 47.245571732000826
<function numpy_csv at 0x7f222ba5cc80> 47.25457685799847
<function pandas_csv at 0x7f2228572620> 43.39656048499819
<function pandas_csv at 0x7f2228572620> 43.5016079220004
<function pandas_csv at 0x7f2228572620> 43.567352316000324
<function numpy_fromstring at 0x7f593ed3cc80> 32.490607361
<function numpy_fromstring at 0x7f593ed3cc80> 32.421125410997774
<function numpy_fromstring at 0x7f593ed3cc80> 32.37903898300283
<function pandas_csv_2 at 0x7f846d1aa730> 24.903284349999012
<function pandas_csv_2 at 0x7f846d1aa730> 25.498485038999206
<function pandas_csv_2 at 0x7f846d1aa730> 25.03262125800029
Run Code Online (Sandbox Code Playgroud)
从上面链接的博客文章看,大熊猫似乎可以以145/1.279502= 113 MB/s 的数据速率导入随机双打的csv矩阵.我的文件是814 MB,所以pandas只为我管理~19 MB/s!
编辑:正如@ASGM指出的那样,这对熊猫来说并不公平,因为它不是为了进行rowise迭代而设计的.我已经在基准测试中包含了建议的改进,但它仍然比纯python方法慢.(另外:在将其简化为此基准测试之前,我已经使用了类似代码的分析,并且解析总是占据所花费的时间.)
edit2:最好的三次没有sum:
python_list_csv 17.8
python_array_csv 23.0
numpy_csv 28.6
numpy_fromstring 13.3
pandas_csv_2 24.2
Run Code Online (Sandbox Code Playgroud)
所以没有总和比numpy.fromstring纯粹的python差一点(我认为fromstring用C语写,所以这是有道理的).
EDIT3:
我在这里用C/C++ float解析代码做了一些实验,看起来我可能对pandas/numpy有太多的期待.那里列出的大多数强大的解析器只需要10秒以上的时间来解析这个浮点数.响亮的唯一解析器numpy.fromstring是boost spirit::qi,它是C++,所以不太可能进入任何python库.
[更精确的结果:spirit::qi〜3S,lexical_cast〜7S,atof和strtod〜10S, sscanf〜18,stringstream并且stringstream reused是在50s和28S慢得令人难以置信.]
ali*_*i_m 10
您的CSV文件是否包含列标题?如果没有,那么明确地传递header=None给pandas.read_csv可以给Python的解析引擎有轻微的性能提升(但不是为C引擎):
In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')
In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop
In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop
In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop
In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop
Run Code Online (Sandbox Code Playgroud)
如果没有丢失或无效的值,那么你可以通过传递na_filter=False(仅对C引擎有效)做得更好:
In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop
In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop
Run Code Online (Sandbox Code Playgroud)
通过明确指定可能会有小的收益dtype:
In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop
Run Code Online (Sandbox Code Playgroud)
继续@ morningsun的评论,设置low_memory=False挤出更快的速度:
In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop
In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop
Run Code Online (Sandbox Code Playgroud)
对于它的价值,这些基准测试都是使用当前开发版的pandas(0.16.0-19-g8d2818e)完成的.
| 归档时间: |
|
| 查看次数: |
8626 次 |
| 最近记录: |