Yux*_*ang 29 python csv matlab numpy
我发布了这个问题,因为我想知道我是否做了一些非常错误的结果.
我有一个中等大小的csv文件,我试图使用numpy加载它.为了说明,我使用python制作了文件:
import timeit
import numpy as np
my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
Run Code Online (Sandbox Code Playgroud)
然后,我尝试了两种方法:numpy.genfromtxt,numpy.loadtxt
setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.loadtxt('./test.csv', delimiter=',')
"""
t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
Run Code Online (Sandbox Code Playgroud)
结果显示t1 = 32.159652940464184,t2 = 52.00093725634724.
但是,当我尝试使用matlab时:
tic
for i = 1:3
my_data = dlmread('./test.csv');
end
toc
Run Code Online (Sandbox Code Playgroud)
结果显示:经过的时间是3.196465秒.
我知道加载速度可能会有一些差异,但是:
任何输入将不胜感激.非常感谢提前!
DSM*_*DSM 44
是的,读取csv文件numpy非常慢.代码路径中有很多纯Python.这些天,即使我使用纯numpy我仍然使用pandasIO:
>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms
Run Code Online (Sandbox Code Playgroud)
或者,在这个简单的情况下,你可以使用像Joe Kington这样写的东西:
>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s
Run Code Online (Sandbox Code Playgroud)
还有Warren Weckesser的文本阅读器库,以防万一pandas依赖:
>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s
Run Code Online (Sandbox Code Playgroud)
Nic*_*mer 11
我已经使用perfplot(我的一个小项目)对建议的解决方案进行了性能测试,发现
pandas.read_csv(filename)
Run Code Online (Sandbox Code Playgroud)
确实是最快的解决方案(如果读取超过 2000 个条目,则在此之前所有内容都在毫秒范围内)。它的性能比 numpy 的变体高出大约 10 倍。(这里的 numpy.fromfile 只是为了比较,它无法读取实际的 csv 文件。)
重现情节的代码:
import numpy
import pandas
import perfplot
numpy.random.seed(0)
filename = "a.txt"
def setup(n):
a = numpy.random.rand(n)
numpy.savetxt(filename, a)
return None
def numpy_genfromtxt(data):
return numpy.genfromtxt(filename)
def numpy_loadtxt(data):
return numpy.loadtxt(filename)
def numpy_fromfile(data):
out = numpy.fromfile(filename, sep=" ")
return out
def pandas_readcsv(data):
return pandas.read_csv(filename, header=None).values.flatten()
def kington(data):
delimiter = " "
skiprows = 0
dtype = float
def iter_func():
with open(filename, "r") as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
kington.rowlength = len(line)
data = numpy.fromiter(iter_func(), dtype=dtype).flatten()
return data
b = perfplot.bench(
setup=setup,
kernels=[numpy_genfromtxt, numpy_loadtxt, numpy_fromfile, pandas_readcsv, kington],
n_range=[2 ** k for k in range(23)],
)
b.save("out.png")
Run Code Online (Sandbox Code Playgroud)
如果你想保存并读取一个numpy数组,最好将它保存为二进制或压缩二进制文件,具体取决于大小:
my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
np.save('./testy', my_data)
np.savez('./testz', my_data)
del my_data
setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.load('./testy.npy')
"""
stmt3 = """\
my_data = np.load('./testz.npz')['arr_0']
"""
t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
t3 = timeit.timeit(stmt=stmt3, setup=setup_stmt, number=3)
genfromtxt 39.717250824
save 0.0667860507965
savez 0.268463134766
Run Code Online (Sandbox Code Playgroud)