Pandas agg 函数为 numpy std 与 nanstd 提供了不同的结果

use*_*537 9 python pandas

我正在转换一些 numpy 代码以使用 pandas DataFrame。数据可能包含 NaN 值,因此我使用 numpy 的 nan 函数,例如nanstd在原始代码中。我的印象是 pandas 默认情况下会跳过 NaN 值,因此我转而使用相同函数的常规版本。

我想对数据进行分组并使用 计算一些统计数据agg(),但是当我使用时,np.std()即使数据不包含任何 NaN,我也会得到与原始代码不同的结果

这是一个演示问题的小例子

>>> arr = np.array([[1.17136, 1.11816],
                    [1.13096, 1.04134],
                    [1.13865, 1.03414],
                    [1.09053, 0.96330],
                    [1.02455, 0.94728],
                    [1.18182, 1.04950],
                    [1.09620, 1.06686]])

>>> df = pd.DataFrame(arr, 
                      index=['foo']*3 + ['bar']*4, 
                      columns=['A', 'B'])

>>> df
           A        B
foo  1.17136  1.11816
foo  1.13096  1.04134
foo  1.13865  1.03414
bar  1.09053  0.96330
bar  1.02455  0.94728
bar  1.18182  1.04950
bar  1.09620  1.06686

>>> g = df.groupby(df.index)

>>> g['A'].agg([np.mean, np.median, np.std])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452

>>> g['A'].agg([np.mean, np.median, np.nanstd])
         mean    median    nanstd
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516
Run Code Online (Sandbox Code Playgroud)

如果我使用 numpy 函数计算 std 值,则在两种情况下都会得到预期结果。函数内部发生了什么agg()

>>> np.std(df.loc['foo', 'A'])
0.01751583474079002
>>> np.nanstd(df.loc['foo', 'A'])
0.017515834740790021
Run Code Online (Sandbox Code Playgroud)

编辑:

正如 Vivek Harikrishnan 链接的答案中提到的,pandas 使用不同的方法来计算 std。这似乎与我的结果相符

>>> g['A'].agg(['mean', 'median', 'std'])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452
Run Code Online (Sandbox Code Playgroud)

如果我指定一个调用的 lambda,np.std()我会得到预期的结果

>>> g['A'].agg([np.mean, np.median, lambda x: np.std(x)])
         mean    median  <lambda>
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516
Run Code Online (Sandbox Code Playgroud)

这表明当我编写 .pandas 函数时,会调用 pandas 函数g['A'].agg([np.mean, np.median, np.std])。问题是当我明确告诉它使用 numpy 函数时为什么会发生这种情况?

Max*_*axU 11

看来 Pandas 要么用内置的 Pandas 方法替换调用np.std,要么调用:.agg([np.mean, np.median, np.std])Series.std()np.std(series, ddof=1)

In [337]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x)])
Out[337]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.055856
foo  1.146990  1.138650  0.021452  0.017516
Run Code Online (Sandbox Code Playgroud)

注意:注意这一点np.stdlambda x: np.std(x)产生不同的结果。

如果我们ddof=1明确指定(Pandas 默认),那么我们将得到相同的结果:

In [338]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x, ddof=1)])
Out[338]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452
Run Code Online (Sandbox Code Playgroud)

使用内置函数'std'会产生相同的结果:

In [341]: g['A'].agg([np.mean, np.median, 'std', lambda x: np.std(x, ddof=1)])
Out[341]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452
Run Code Online (Sandbox Code Playgroud)

Python Zen 的第二条规则说明了一切:

In [340]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.  # <----------- NOTE !!!
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Run Code Online (Sandbox Code Playgroud)

  • @user3419537 或 MaxU:我们应该将这种行为视为错误吗?默默地改变用户指定的函数的默认行为正是一种导致整个领域大量论文被撤回的疏忽。我发现https://github.com/pandas-dev/pandas/issues/18734,这是相关的,但略有不同。 (3认同)