如何让pandas groupby不懒惰？

Question

如何让pandas groupby不懒惰？

use*_*964 5 python group-by pandas

本教程中提到 pandas groupby 对象是惰性的。

\n\n

\n
它\xe2\x80\x99本质上是懒惰的。在您这么说之前，它不会真正执行任何操作来产生有用的结果。
\n

\n\n

和

\n\n

\n
还值得一提的是，.groupby() 确实通过为您传递的每个键构建一个 Grouping 类实例来完成一些（但不是全部）分割工作。然而，保存这些分组的 BaseGrouper 类的许多方法都是延迟调用的，而不是在 init ( ) 处调用的，而且许多方法还使用缓存的属性设计。
\n

\n\n

所以我做了一些测试来确保 groupby 真的很懒。

\n\n

让

\n\n

df=pd.DataFrame(np.random.randint(1,10,size=(1000000,4)))\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后

\n\n

%timeit gg=df.groupby(1)\n35.6 \xc2\xb5s \xc2\xb1 110 ns per loop (mean \xc2\xb1 std. dev. of 7 runs, 10000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n\n

这几乎不需要时间。和

\n\n

%timeit res=gg.get_group(1)\n2.76 ms \xc2\xb1 8 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n\n

时间比

\n\n

%timeit res=df[df[1]==1]\n6.87 ms \xc2\xb1 16.9 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n\n

另一方面，如果我们首先提取组

\n\n

%timeit gdict=df.groupby(1).groups\n15.7 ms \xc2\xb1 35.2 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后得到组不需要时间

\n\n

%timeit gdict[1]\n29.8 ns \xc2\xb1 0.0989 ns per loop (mean \xc2\xb1 std. dev. of 7 runs, 10000000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n\n

所以我的问题是

\n\n

为什么 pandas 设计得groupby这么懒？在实际应用中，我认为我几乎总是需要对组对象进行许多进一步的操作。get_group如果组对象一开始就懒于分割数据帧，那么每次执行诸如此类的操作时都会浪费时间。
我也不明白“ .groupby() 确实通过为您传递的每个键构建分组类实例来完成一些分割工作，但不是全部\xef\xbc\x8co”，这是什么意思？
是否可以使 groupby 对象不懒惰？

\n

Answer 1

Dim*_*try 1

你需要一个更大的基准：

\n

import numpy as np, pandas as pd\ndf=pd.DataFrame(np.random.randint(1,10,size=(100000000,4))) #3GB data\ngg=df.groupby(1)\n%time _ = gg.get_group(1) #first call slow\n%time _ = gg.get_group(1) #fast\n%time _ = gg.get_group(2) #other group lookup is also fast \n%timeit _ = gg.get_group(1) #gives wrong result\n

Run Code Online (Sandbox Code Playgroud)\n

Groupby 很懒，因为它不会groups立即计算。它会在向他们发出第一个请求时这样做。或者当您使用 IPython 并gg在光标下点击 Tab 时。如果你跟踪进程的内存消耗就可以看到。或者你可以在 IPython 案例中感受到它。

\n

很难猜测幕后发生了什么，但get_group似乎有自己的缓存，而类似或groups类似的方法共享一个缓存。可能会尝试最大程度地减少不同用例的内存使用量。不管怎样，第一次使用后懒惰感就消失了。summin

\n
最后的测试是错误的。gg.groups包含索引，而不是组本身：
\n
%timeit df.loc[gdict[1]] #It is actually the slowest\n1.23 s \xc2\xb1 26.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n%timeit df[df[1]==1]\n928 ms \xc2\xb1 23.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n%timeit gg.get_group(1)\n510 ms \xc2\xb1 30.1 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n
从字典中检索项目确实快了数千倍，但是您会以空间换取速度。
\n
如果您绝对确定需要在同一组上运行函数多次，您可以尝试对列上的数据帧进行排序并保存组切片。
\n
%time df = df.sort_values(1,ignore_index=True)\n#Wall time: 10.3 s\n%time ids = df[1].diff().to_numpy().nonzero()[0]\n#Wall time: 1.88 s\n%time gl = {df[1][v] : slice(v,ids[i+1] if (i+1)<len(ids) else None) for i,v in enumerate(ids)}\n#Wall time: 112 \xc2\xb5s\n%timeit df[gl[1]]\n#12.1 \xc2\xb5s \xc2\xb1 208 ns per loop (mean \xc2\xb1 std. dev. of 7 runs, 100000 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n
对于某些用例，排序数据可能是最快的。
\n
%timeit {k:df[v].sum() for k,v in gl.items()}\n1.16 s \xc2\xb1 42.7 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n%timeit gg.sum()\n2.73 s \xc2\xb1 29.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n%timeit {x: gg.get_group(x).sum() for x in range(1,10)}\n4.23 s \xc2\xb1 61.9 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n

归档时间：	6 年，2 月前
查看次数：	963 次
最近记录：	4 年，8 月前