传递什么Pandas数据类型以转换或应用于groupby

8on*_*ne6 6 python pandas

在尝试调试groupby函数应用程序时,有人建议我使用虚函数"查看正在传递的内容"到每个组的函数中.当然,我是游戏:

import numpy as np
import pandas as pd

np.random.seed(0) # so we can all play along at home

categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))

df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})

def f(x):
    print type(x)
    return x

print 'single column transform'
df.groupby(['category'])['data_1'].transform(f)
print '\n'

print 'single column (nested) transform'
df.groupby(['category'])[['data_1']].transform(f)
print '\n'

print 'multiple column transform'
df.groupby(['category'])[['data_1', 'data_2']].transform(f)

print '\n'
print '\n'

print 'single column apply'
df.groupby(['category'])['data_1'].apply(f)
print '\n'

print 'single column (nested) apply'
df.groupby(['category'])[['data_1']].apply(f)
print '\n'

print 'multiple column apply'
df.groupby(['category'])[['data_1', 'data_2']].apply(f)
Run Code Online (Sandbox Code Playgroud)

这将以下内容放入我的标准输出中:

single column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


single column (nested) transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>




single column apply
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


single column (nested) apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


multiple column apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Run Code Online (Sandbox Code Playgroud)

所以看起来像:

  • 转变
    • 单列:3 Series
    • 单列(嵌套):2 Series和3DataFrame
    • 多列:3 Series和3DataFrame
  • 应用
    • 单列:3 Series
    • 单列(嵌套):4 DataFrame
    • 多列:4 DataFrame

这里发生了什么?任何人都可以解释为什么这6个调用中的每一个都导致上面描述的一系列对象被传递给指定的函数?

HYR*_*YRY 4

GroupBy.transform将为您的函数尝试 fast_path 和 Slow_path 。

  • fast_path:使用 DataFrame 对象调用函数
  • DataFrame.applySlow_path:用函数调用你的函数

当fast_path的结果与slow_path相同时,会选择fast_path。

以下输出意味着它最终选择了 fast_path:

multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Run Code Online (Sandbox Code Playgroud)

这是代码链接:

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2277

编辑

检查调用堆栈:

import numpy as np
import pandas as pd

np.random.seed(0) # so we can all play along at home

categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))

df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})

import traceback
import inspect
import itertools

def f(x):
    flag = True
    stack = itertools.dropwhile(lambda x:"#stop here" not in x, 
                                traceback.format_stack(inspect.currentframe().f_back))
    print "*"*20
    print x
    print type(x)
    print
    print "\n".join(stack)
    return x

df.groupby(['category'])[['data_1', 'data_2']].transform(f) #stop here
Run Code Online (Sandbox Code Playgroud)