Owe*_*wen 62 python dictionary pandas
Pandas真的很棒,但我真的很惊讶从Pandas.DataFrame中检索值的效率低下.在下面的玩具示例中,即使DataFrame.iloc方法也比字典慢100多倍.
问题:这里的教训是,词典是更好的查找价值观的方法吗?是的,我知道这正是他们的目的.但我只是想知道是否有关于DataFrame查找性能的遗漏.
我意识到这个问题比"问"更"沉思",但我会接受一个提供洞察力或观点的答案.谢谢.
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
dictionary = df.to_dict()
'''
f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
for func in f:
print func
print min(timeit.Timer(func, setup).repeat(3, 100000))
Run Code Online (Sandbox Code Playgroud)
value =字典[5] [5]
0.130625009537
value = df.loc [5,5]
19.4681699276
value = df.iloc [5,5]
17.2575249672
unu*_*tbu 88
一个字典是一个DataFrame,因为自行车是一辆汽车.你可以骑自行车10英尺以上的速度比开车,开档等等.但是如果你需要走一英里,车就会赢.
对于某些小的,有针对性的目的,dict可能更快.如果这就是你所需要的,那么肯定要使用dict!但是,如果您需要/想要DataFrame的强大功能和奢侈品,那么dict就无法替代.如果数据结构不能首先满足您的需求,那么比较速度是没有意义的.
现在举个例子 - 更具体一点 - dict适合访问列,但访问行不太方便.
import timeit
setup = '''
import numpy, pandas
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000]))
dictionary = df.to_dict()
'''
# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]']
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]']
for func in f:
print(func)
print(min(timeit.Timer(func, setup).repeat(3, 100000)))
Run Code Online (Sandbox Code Playgroud)
产量
value = [val[5] for col,val in dictionary.iteritems()]
25.5416321754
value = df.loc[5]
5.68071913719
value = df.iloc[5]
4.56006002426
Run Code Online (Sandbox Code Playgroud)
因此,列表的字典在检索行时要慢5倍df.iloc.随着列数的增加,速度不足变得更大.(列数就像自行车类比中的脚数.距离越长,汽车就越方便......)
这只是列表的dict不如DataFrame方便/慢的一个例子.
另一个例子是当你有一个行的DatetimeIndex并希望选择某些日期之间的所有行.使用DataFrame,您可以使用
df.loc['2000-1-1':'2000-3-31']
Run Code Online (Sandbox Code Playgroud)
如果您使用列表的词典,那么就没有简单的类比.与DataFrame相比,用于选择正确行的Python循环将再次非常慢.
joo*_*oon 17
现在看来性能差异要小得多(0.21.1 - 我忘记了原始示例中Pandas的版本).不仅字典访问和之间的性能差距.loc减小(从约335倍到126倍速度较慢), (loc)iloc是慢于小于两倍at(iat现在).
In [1]: import numpy, pandas
...: ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: ...: dictionary = df.to_dict()
...:
In [2]: %timeit value = dictionary[5][5]
85.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: %timeit value = df.loc[5, 5]
10.8 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit value = df.at[5, 5]
6.87 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit value = df.iloc[5, 5]
14.9 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit value = df.iat[5, 5]
9.89 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: print(pandas.__version__)
0.21.1
Run Code Online (Sandbox Code Playgroud)
----原文如下----
+1用于使用at或iat用于标量操作.示例基准:
In [1]: import numpy, pandas
...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10]))
...: dictionary = df.to_dict()
In [2]: %timeit value = dictionary[5][5]
The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 310 ns per loop
In [4]: %timeit value = df.loc[5, 5]
10000 loops, best of 3: 104 µs per loop
In [5]: %timeit value = df.at[5, 5]
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.26 µs per loop
In [6]: %timeit value = df.iloc[5, 5]
10000 loops, best of 3: 98.8 µs per loop
In [7]: %timeit value = df.iat[5, 5]
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.58 µs per loop
Run Code Online (Sandbox Code Playgroud)
似乎using at(iat)比loc(iloc)快10倍.
小智 6
我遇到了同样的问题。你可以at用来改进。
“由于使用 [] 进行索引必须处理很多情况(单标签访问、切片、布尔索引等),因此为了弄清楚您的要求,它有一些开销。如果您只想访问标量值,最快的方法是使用at和iat方法,它们在所有数据结构上实现。”
请参阅官方参考http://pandas.pydata.org/pandas-docs/stable/indexing.html章节“快速标量值获取和设置”