Bru*_*cci 19 python performance pandas
在对Pandas(0.17.1)DataFrame上的各种类型的查找进行实验时,我只剩下几个问题.
这是设置......
import pandas as pd
import numpy as np
import itertools
letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]
df1 = pd.DataFrame({
'value': np.random.normal(size=(1000000)),
'letter': np.random.choice(letter_combinations, 1000000)
})
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()
Run Code Online (Sandbox Code Playgroud)
所以df1看起来像这样......
print(df1.head(5))
>>>
letter value
0 bdh 0.253778
1 cem -1.915726
2 mru -0.434007
3 lnw -1.286693
4 fjv 0.245523
Run Code Online (Sandbox Code Playgroud)
以下是测试查找性能差异的代码...
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']
print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']
print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']
Run Code Online (Sandbox Code Playgroud)
结果......
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop
Run Code Online (Sandbox Code Playgroud)
问题...
很明显为什么对排序索引的查找速度要快得多,二进制搜索得到O(log(n))性能与O(n)进行全阵列扫描.但是,为什么排序的非索引df2列SLOWER上的查找比未排序的非索引列上的查找df1?
怎么了The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached.当然,结果没有被缓存.是因为创建的索引是懒惰的,并且在需要之前实际上没有重新编制索引?这可以解释为什么它只是第一次打电话给.loc[].
为什么默认情况下不对索引进行排序?这种固定成本可能太多了?
unu*_*tbu 11
这些%timeit的差异导致
In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop
In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop
Run Code Online (Sandbox Code Playgroud)
也显示在纯粹的NumPy平等比较中:
In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop
In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop
Run Code Online (Sandbox Code Playgroud)
在引擎盖下,Pandas df1['letter'] == 'ben' 调用一个Cython函数
,它循环遍历底层NumPy数组的值,
df1['letter'].values.它基本上做同样的事情,
df1['letter'].values == 'ben'但对NaNs的处理不同.
此外,请注意,df1['letter']按顺序访问项目可以比执行相同操作更快地完成df2['letter']:
In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop
In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop
Run Code Online (Sandbox Code Playgroud)
这三组%timeit测试中每一组的时间差异大致相同.我认为这是因为他们都有同样的原因.
由于letter列包含字符串,则NumPy的阵列df1['letter'].values和
df2['letter'].values具有D型object,因此它们保持指针指向任意Python对象的存储位置(在这种情况下的字符串).
考虑存储在DataFrame中的字符串的内存位置,df1以及
df2.在CPython中,id返回对象的内存位置:
memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
'df2': list(map(id, df2['letter'])), })
df1 df2
0 140226328244040 140226299303840
1 140226328243088 140226308389048
2 140226328243872 140226317328936
3 140226328243760 140226230086600
4 140226328243368 140226285885624
Run Code Online (Sandbox Code Playgroud)
df1(在前十几个之后)中的字符串倾向于在内存中顺序出现,而排序导致字符串df2(按顺序排列)分散在内存中:
In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]:
df1 df2
0 NaN NaN
1 -952.0 9085208.0
2 784.0 8939888.0
3 -112.0 -87242336.0
4 -392.0 55799024.0
5 -392.0 5436736.0
6 952.0 22687184.0
7 56.0 -26436984.0
8 -448.0 24264592.0
9 -56.0 -4092072.0
10 -168.0 -10421232.0
11 -363584.0 5512088.0
12 56.0 -17433416.0
13 56.0 40042552.0
14 56.0 -18859440.0
15 56.0 -76535224.0
16 56.0 94092360.0
17 56.0 -4189368.0
18 56.0 73840.0
19 56.0 -5807616.0
20 56.0 -9211680.0
21 56.0 20571736.0
22 56.0 -27142288.0
23 56.0 5615112.0
24 56.0 -5616568.0
25 56.0 5743152.0
26 56.0 -73057432.0
27 56.0 -4988200.0
28 56.0 85630584.0
29 56.0 -4706136.0
Run Code Online (Sandbox Code Playgroud)
大多数字符串df1相隔56个字节:
In [14]:
In [16]: diffs['df1'].value_counts()
Out[16]:
56.0 986109
120.0 13671
-524168.0 215
-56.0 1
-12664712.0 1
41136.0 1
-231731080.0 1
Name: df1, dtype: int64
In [20]: len(diffs['df1'].value_counts())
Out[20]: 7
Run Code Online (Sandbox Code Playgroud)
相比之下,弦乐df2遍布整个地方:
In [17]: diffs['df2'].value_counts().head()
Out[17]:
-56.0 46
56.0 44
168.0 39
-112.0 37
-392.0 35
Name: df2, dtype: int64
In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764
Run Code Online (Sandbox Code Playgroud)
当这些对象(字符串)按顺序位于内存中时,可以更快地检索它们的值.这就是为什么执行的相等比较
df1['letter'].values == 'ben'可以比那些更快地完成df2['letter'].values
== 'ben'.查找时间较短.
此内存访问问题还解释了为什么列的%timeit结果没有差异
value.
In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop
In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop
Run Code Online (Sandbox Code Playgroud)
df1['value']并且df2['value']是dtype的NumPy数组float64.与对象数组不同,它们的值在内存中连续打包在一起.排序df1
与df2 = df1.sort_values('letter')原因中的值df2['value']重新排序,但由于值被复制到一个新的NumPy的阵列,所述值位于顺序在存储器中.因此,访问值df2['value']可以像在那里一样快df1['value'].
(1) pandas目前不知道列的排序.
如果您想利用排序数据,可以使用df2.letter.searchsorted See @ unutbu的答案来解释实际导致时间差异的原因.
(2)位于索引下面的哈希表是懒惰创建的,然后缓存.