对熊猫数据框进行索引查找。为何这么慢？如何加速？

Question

对熊猫数据框进行索引查找。为何这么慢？如何加速？

use*_*956 5 python indexing performance pandas

假设我有一个 Pandas 系列，我想将其用作多重映射（每个索引键有多个值）：

# intval -> data1
a = pd.Series(data=-np.arange(100000),
              index=np.random.randint(0, 50000, 100000))

Run Code Online (Sandbox Code Playgroud)

我想（尽可能快地）从a wherea的索引与另一个索引匹配的所有值中选择b。（就像一个内部连接。或者一个合并，但用于系列）。

a 其索引中可能有重复项。
b可能没有重复项，也不一定是a的索引的子集。为了给熊猫最好的机会，让我们假设b也可以作为排序的索引对象提供：

     b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()

Run Code Online (Sandbox Code Playgroud)

所以，我们会有类似的东西：

                      target  
   a        b         result
3  0        3      3  0
3  1        7      8  3 
4  2        8      ...     
8  3      ...
9  4
...

Run Code Online (Sandbox Code Playgroud)

我也只对获取结果的值感兴趣（[3,8,...]不需要索引）。

如果a没有重复，我们会简单地做：

a.reindex(b)  # Cannot reindex a duplicate axis

Run Code Online (Sandbox Code Playgroud)

因为&维护的重复项a，我们不能这样做：

d = a[a.index & b.index]
d = a.loc[a.index & b.index]  # same
d = a.get(a.index & b.index)  # same
print d.shape

Run Code Online (Sandbox Code Playgroud)

所以我认为我们需要做一些类似的事情：

common = (a.index & b.index).unique()
a.loc[common]

Run Code Online (Sandbox Code Playgroud)

...这很麻烦，但也出奇的慢。它不是构建要选择的项目列表，这很慢：

%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop

Run Code Online (Sandbox Code Playgroud)

...所以看起来它真的检索缓慢的值：

common = ((a.index & b).unique()).sort_values()

%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop

%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop

Run Code Online (Sandbox Code Playgroud)

... 大约每秒 20 次操作。不完全是活泼的！为何这么慢？

当然必须有一种快速的方法来从熊猫数据框中查找一组值？我不想得到一个索引对象——实际上我所要求的只是对排序索引进行合并，或者（较慢的）散列 int 查找。无论哪种方式，这都应该是一个非常快的操作——而不是我的 3Ghz 机器上每秒 20 次的操作。

还：

分析a.loc[common]给出：

ncalls  tottime  percall  cumtime   percall filename:lineno(function)
# All the time spent here.
40      1.01     0.02525  1.018     0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500    0.000582 3.88e-07 0.000832  5.547e-07 ~:0(<isinstance>)

Run Code Online (Sandbox Code Playgroud)

附注。我之前发布了一个类似的问题，关于为什么 Series.map 这么慢为什么 pandas.series.map 这么慢？. 原因是懒惰的底层索引。这似乎不会在这里发生。

更新：

对于类似大小的 a 和 common，其中 a 是唯一的：

% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop

Run Code Online (Sandbox Code Playgroud)

...正如@jpp 指出的那样。多索引可能是罪魁祸首。

Answer 1

jpp*_*jpp 4

重复索引肯定会减慢数据帧索引操作的速度。您可以修改您的输入以向自己证明这一点：

a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common]  # 34.1 ms

a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common]  # 6.86 ms

Run Code Online (Sandbox Code Playgroud)

正如这个相关问题中提到的：

当索引唯一时，pandas 使用哈希表将键映射到值 O(1)。当索引非唯一且已排序时，pandas 使用二分搜索 O(logN)，当索引是随机排序时，pandas 需要检查索引中的所有键 O(N)。

归档时间：	6 年，11 月前
查看次数：	4152 次
最近记录：	6 年，11 月前