为什么熊猫逻辑运算符不像应该那样在索引上对齐？

Question

为什么熊猫逻辑运算符不像应该那样在索引上对齐？

考虑以下简单设置：

x = pd.Series([1, 2, 3], index=list('abc'))
y = pd.Series([2, 3, 3], index=list('bca'))

x

a    1
b    2
c    3
dtype: int64

y

b    2
c    3
a    3
dtype: int64

Run Code Online (Sandbox Code Playgroud)

如您所见，索引是相同的，只是顺序不同。

现在，考虑使用equals（==）运算符进行简单的逻辑比较：

x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

Run Code Online (Sandbox Code Playgroud)

抛出ValueError，很可能是因为索引不匹配。另一方面，调用等效eq运算符可以：

x.eq(y)

a    False
b     True
c     True
dtype: bool

Run Code Online (Sandbox Code Playgroud)

OTOH，给定的运算符方法y首先被重新排序...

x == y.reindex_like(x)

a    False
b     True
c     True
dtype: bool

Run Code Online (Sandbox Code Playgroud)

我的理解是，函数和运算符的比较应该做相同的事情，而其他所有事情都是相等的。什么是eq这样做的运营商比较不？

Answer 1

use*_*ica 31

查看整个回溯，以查找索引不匹配的系列比较，尤其关注异常消息：

In [1]: import pandas as pd
In [2]: x = pd.Series([1, 2, 3], index=list('abc'))
In [3]: y = pd.Series([2, 3, 3], index=list('bca'))
In [4]: x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-73b2790c1e5e> in <module>()
----> 1 x == y
/usr/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1188 
   1189         elif isinstance(other, ABCSeries) and not self._indexed_same(othe
r):
-> 1190             raise ValueError("Can only compare identically-labeled "
   1191                              "Series objects")
   1192 
ValueError: Can only compare identically-labeled Series objects

Run Code Online (Sandbox Code Playgroud)

我们看到这是一个有意实施的决定。而且，这不是Series对象所独有的-DataFrames会引发类似的错误。

最终，通过挖掘有关相关代码的Git责任，最终会发现一些相关的提交和问题跟踪线程。例如，Pandas的作者Wes McKinney Series.__eq__曾经完全忽略RHS的索引，并在有关该行为的错误报告的评论中说：

这实际上是功能/故意的选择，而不是错误，它与＃652有关。早在1月份，我就更改了比较方法以进行自动对齐，但是发现它导致大量错误/损坏用户，尤其是许多NumPy函数（通常会执行类似arr[1:] == arr[:-1]；例如：的功能np.unique）停止工作。

这回到了Series不够像ndarray的问题，并且可能不应该是ndarray的子类。

因此，除此以外，我没有一个很好的答案。自动对齐将是理想的选择，但是除非我使Series不是ndarray的子类，否则我认为我无法做到。我认为这可能是个好主意，但要等到0.9或0.10（接下来的几个月），才有可能实现。

然后将其更改为熊猫0.19.0中的当前行为。引用“新功能”页面：

已更改以下系列运算符以使所有运算符一致，包括DataFrame（GH1134，GH4581， GH13538）

现在，当索引不同时，系列比较运算符会引发ValueError。

系列逻辑运算符将左右两侧的索引对齐。

这使得Series行为与DataFrame的行为相匹配，DataFrame在比较中已经拒绝了不匹配的索引。

总而言之，使比较运算符自动对齐索引会破坏太多内容，因此这是最佳选择。

好答案。应该有一个调查员徽章。针对这样的答案而设计，答案作者显然已经花时间研究，阅读代码，翻阅Git来查找逻辑解释。+1 (7认同)

Answer 2

Qua*_*ang 8

我喜欢python的一件事是，您可以深入了解几乎所有内容的源代码。从pd.Series.eq源代码中，它调用：

def flex_wrapper(self, other, level=None, fill_value=None, axis=0):
    # other stuff
    # ...

    if isinstance(other, ABCSeries):
        return self._binop(other, op, level=level, fill_value=fill_value)

Run Code Online (Sandbox Code Playgroud)

然后继续pd.Series._binop：

def _binop(self, other, func, level=None, fill_value=None):

    # other stuff
    # ...
    if not self.index.equals(other.index):
        this, other = self.align(other, level=level, join='outer',
                                 copy=False)
        new_index = this.index

Run Code Online (Sandbox Code Playgroud)

这意味着eq运算符会在比较之前对齐两个序列（显然，普通运算符==不会）。

Answer 3

WeN*_*Ben 5

回到2012年，当我们没有eq，ne而且gt，pandas有问题：混乱Series将返回逻辑（意外输出>,<,==,!=），所以他们有修复做（增加了新的功能，gt，ge，ne...）

GitHub票证参考

归档时间：	6 年，5 月前
查看次数：	481 次
最近记录：	6 年，5 月前