如何加快 pandas groupby bin 的聚合速度？

Question

如何加快 pandas groupby bin 的聚合速度？

Xin*_*ang 8 python numpy scipy pandas scipy.stats

我为每一列创建了不同的 bin，并根据这些对 DataFrame 进行分组。

\n

import pandas as pd\nimport numpy as np\n\nnp.random.seed(100)\ndf = pd.DataFrame(np.random.randn(100, 4), columns=[\'a\', \'b\', \'c\', \'value\'])\n\n# for simplicity, I use the same bin here\nbins = np.arange(-3, 4, 0.05)\n\ndf[\'a_bins\'] = pd.cut(df[\'a\'], bins=bins)\ndf[\'b_bins\'] = pd.cut(df[\'b\'], bins=bins)\ndf[\'c_bins\'] = pd.cut(df[\'c\'], bins=bins)\n

Run Code Online (Sandbox Code Playgroud)\n

的输出df.groupby([\'a_bins\',\'b_bins\',\'c_bins\']).size() 表明组长度为2685619。

\n

计算各组的统计数据

\n

然后，计算各组的统计数据如下：

\n

%%timeit\ndf.groupby([\'a_bins\',\'b_bins\',\'c_bins\']).agg({\'value\':[\'mean\']})\n\n>>> 16.9 s \xc2\xb1 637 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n

Run Code Online (Sandbox Code Playgroud)\n

预期产出

\n

有可能加快这个速度吗？
更快的方法还应该支持通过输入a, b, and c值来查找值，如下所示：

\n

df.groupby([\'a_bins\',\'b_bins\',\'c_bins\']).agg({\'value\':[\'mean\']}).loc[(-1.72, 0.32, 1.18)]\n\n>>> -0.252436\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

sam*_*mmy 6

对于此数据，我建议您对数据进行透视，并传递平均值。通常，这会更快，因为您要访问整个数据框，而不是遍历每个组：

(df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432

Run Code Online (Sandbox Code Playgroud)

这与 groupby 的输出匹配：

(df
 .groupby(['a_bins','b_bins','c_bins'])
 .agg({'value':['mean']})
 .dropna()
 .squeeze()
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432
Name: (value, mean), Length: 100, dtype: float64

Run Code Online (Sandbox Code Playgroud)

在我的电脑上，pivot 选项的速度为 3.72ms，而我不得不终止 groupby 选项，因为它花费的时间太长（我的电脑很旧:)）

同样，它之所以有效/更快的原因是因为平均值触及整个数据帧，而不是通过 groupby 中的组。

至于你的另一个问题，你可以轻松索引它：


bin_mean = (df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

bin_mean.loc[(-1.72, 0.32, 1.18)]
 -0.25243603652138985

Run Code Online (Sandbox Code Playgroud)

但主要问题是用于分类的 Pandas 将返回所有行（这是浪费且效率低下）；通过observed = True，您应该会注意到显着的改进：

(df.groupby(['a_bins','b_bins','c_bins'], observed=True)
   .agg({'value':['mean']})
)

                                              value
                                               mean
a_bins        b_bins        c_bins                 
(-2.15, -2.1] (0.25, 0.3]   (-1.3, -1.25]  0.929100
              (0.75, 0.8]   (-0.3, -0.25]  0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35]   -1.684900
              (0.75, 0.8]   (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...                                             ...
(1.7, 1.75]   (-0.75, -0.7] (1.05, 1.1]   -0.229518
(1.85, 1.9]   (-0.4, -0.35] (1.8, 1.85]    0.003017
(1.9, 1.95]   (-1.45, -1.4] (0.1, 0.15]    0.949361
(2.05, 2.1]   (-0.35, -0.3] (-0.65, -0.6]  0.763184
(2.25, 2.3]   (-0.95, -0.9] (0.1, 0.15]    2.539432

Run Code Online (Sandbox Code Playgroud)

在我的 PC 上，速度约为 7.39 毫秒，大约比枢轴选项少 2 倍，但现在速度更快，这是因为仅使用/返回数据框中存在的分类。

归档时间：	3 年，11 月前
查看次数：	1782 次
最近记录：	3 年，11 月前