bur*_*cak 5 aggregate-functions dataframe pandas pandas-groupby
import pandas as pd
df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)
df_new = df.groupby("a",as_index=True).agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
StartMin=pd.NamedAgg(column='start', aggfunc="min"),
StartMax=pd.NamedAgg(column='start', aggfunc="max"),
)
Run Code Online (Sandbox Code Playgroud)
给
>>>df_new
ProcessiveGroupLength StartMin StartMax
a
xxx 3 10000 11000
yyy 3 12000 14000
Run Code Online (Sandbox Code Playgroud)
如何快速到达下方,因为我认为它会更快。
>>>df_new
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Run Code Online (Sandbox Code Playgroud)
下面的代码给出了以下错误消息:
回溯(最近一次调用):文件“”,第 5 行,类型错误:不支持的操作数类型 -:'str' 和 'str'
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)
Run Code Online (Sandbox Code Playgroud)
您的解决方案应该由 lambda 函数更改,但我认为如果有很多组或/和大型 DataFrame,这应该像第一个解决方案一样慢。
原因是优化的函数max以及 的min向量化减法Series。换句话说,如果不使用 lambda 函数,聚合速度会更快。
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
Run Code Online (Sandbox Code Playgroud)
或者你可以使用numpy.ptp:
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
Run Code Online (Sandbox Code Playgroud)
print (df_new)
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Run Code Online (Sandbox Code Playgroud)
性能:取决于数据,这里使用 1M 行中的 1k 组:
np.random.seed(20)
N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
'start':np.random.randint(10000, size=N)})
print (df)
In [229]: %%timeit
...: df_new = df.groupby("a",as_index=True).agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: StartMin=pd.NamedAgg(column='start', aggfunc="min"),
...: StartMax=pd.NamedAgg(column='start', aggfunc="max"),
...: ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
...:
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [230]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
...:
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
...:
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)