Pandas Dataframe groupby 聚合函数以及动态列的最大值和最小值之间的差异

Question

Pandas Dataframe groupby 聚合函数以及动态列的最大值和最小值之间的差异

bur*_*cak 5 aggregate-functions dataframe pandas pandas-groupby

import pandas as pd

df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)


df_new = df.groupby("a",as_index=True).agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            StartMin=pd.NamedAgg(column='start', aggfunc="min"),
            StartMax=pd.NamedAgg(column='start', aggfunc="max"),
            )

Run Code Online (Sandbox Code Playgroud)

给

>>>df_new
     ProcessiveGroupLength  StartMin  StartMax
a
xxx                      3     10000     11000
yyy                      3     12000     14000

Run Code Online (Sandbox Code Playgroud)

如何快速到达下方，因为我认为它会更快。

>>>df_new
     ProcessiveGroupLength    Diff
a
xxx                      3      1000
yyy                      3      2000

Run Code Online (Sandbox Code Playgroud)

下面的代码给出了以下错误消息：

回溯（最近一次调用）：文件“”，第 5 行，类型错误：不支持的操作数类型 -：'str' 和 'str'

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),                
            Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)

Run Code Online (Sandbox Code Playgroud)

Answer 1

jez*_*ael 7

您的解决方案应该由 lambda 函数更改，但我认为如果有很多组或/和大型 DataFrame，这应该像第一个解决方案一样慢。

原因是优化的函数max以及的min向量化减法Series。换句话说，如果不使用 lambda 函数，聚合速度会更快。

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)

Run Code Online (Sandbox Code Playgroud)

或者你可以使用numpy.ptp：

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)

Run Code Online (Sandbox Code Playgroud)

print (df_new)
     ProcessiveGroupLength  Diff
a                               
xxx                      3  1000
yyy                      3  2000

Run Code Online (Sandbox Code Playgroud)

性能：取决于数据，这里使用 1M 行中的 1k 组：

np.random.seed(20)

N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
                   'start':np.random.randint(10000, size=N)})
print (df)

In [229]: %%timeit
     ...: df_new = df.groupby("a",as_index=True).agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             StartMin=pd.NamedAgg(column='start', aggfunc="min"),
     ...:             StartMax=pd.NamedAgg(column='start', aggfunc="max"),
     ...:             ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
     ...:             
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [230]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
     ...:             
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [231]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
     ...:             
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，5 月前
查看次数：	504 次
最近记录：	5 年，5 月前