分配给 DataFrame 列时保留 Pandas Series 子类的自定义属性和方法

Tyl*_*ker 8 python pandas

我想要创建一个函数,对列表/系列执行 ETL,并返回一个带有特定于该函数的附加属性和方法的系列。我可以通过创建一个类来扩展系列来实现此目的,并且它可以工作,但是当我尝试使用新类重新分配函数的输出时,更新的类属性和方法将被剥离。 如何扩展系列以具有在重新分配回数据框时不会被剥离的自定义属性和方法?

执行 ETL 的自定义函数,返回带有扩展类的 Series

import pandas as pd

def normalize_x(x: list, new_attribute: None):
 
    normalized = pd.Series(['normalized_'+ i if i != 4 else None for i in x])
    
    return NormalizeX(normalized = normalized, original = x, new_attribute = new_attribute)


class NormalizeX(pd.Series):

    def __init__(self, normalized, original, new_attribute, *args, **kwargs,):
        super().__init__(data = normalized, *args, **kwargs)

        self.original = original
        self.normalized = normalized
        self.new_attribute = new_attribute


    def conversion_errors(self):

        return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]


df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
Run Code Online (Sandbox Code Playgroud)

分配给新对象(新属性和方法有效)

out = normalize_x(df.C, new_attribute = 'CoolAttribute')

out
## 0    normalized_dog
## 1    normalized_cat
## 2              None
## dtype: object

## Can still use Series methods
out.to_list()
## ['normalized_dog', 'normalized_cat', None]

## Can use the new methods and access attributes
out.conversion_errors()
## [False, False, True]
out.original
##0    dog
##1    cat
##2      4
##Name: C, dtype: object
Run Code Online (Sandbox Code Playgroud)

分配给 Pandas DataFrame(新属性和方法中断)

df['new'] = normalize_x(df.C, new_attribute = 'CoolAttribute')

df['new']
## 0    normalized_dog
## 1    normalized_cat
## 2              None
## dtype: object

## Can't use the new methods or access attributes
df['new'].conversion_errors()
## AttributeError: 'Series' object has no attribute 'conversion_errors'
df['new'].original
## AttributeError: 'Series' object has no attribute 'original'
Run Code Online (Sandbox Code Playgroud)

Sbu*_*ini 8

Pandas 允许您扩展它的类(Series、DataFrame)。就您而言,解决方案非常冗长,但我认为这是您实现目标的唯一方法。

\n

我尝试在不分析复杂案例的情况下直奔主题,因此接口的完整实现取决于您,但我可以让您了解可以使用什么。

\n

我不明白new_attribute的实用性,所以暂时不考虑。基本上,据我所知,Pandas 扩展只允许您扩展一维数组。由于您有多个数组(都是标准化的原始的),您必须创建另一种数据类型来解决该问题。

\n
class NormX(object):\n    def __init__(self, normalized, original):\n        self.normalized = normalized\n        self.original = original\n\n    def __repr__(self,):\n        if self.normalized is None:\n            return \'Nan\'\n        return self.normalized\n
Run Code Online (Sandbox Code Playgroud)\n

这允许您创建一个简单的基础对象,如下所示:

\n
norm_obj = NormX(\'normalized_dog\', \'dog\')\n
Run Code Online (Sandbox Code Playgroud)\n

该对象将是自定义数组的基本块。为了能够利用这种类,您必须在 Pandas 中注册一个新类型:

\n
norm_obj = NormX(\'normalized_dog\', \'dog\')\n
Run Code Online (Sandbox Code Playgroud)\n

现在您已具备构建基于 Pandas 框架的自定义数组的所有元素。为此,您必须扩展其名为 的接口类ExtensionArray在这里你可以找到子类必须实现的抽象方法。我给了你一个非常基本的实现,但应该以正确的方式声明它:

\n
from pandas.api.extensions import ExtensionArray\n\nclass NormalizeX(ExtensionArray):\n    def __init__(self, values):\n        self.data = values\n        \n    def __repr__(self,):\n        return "NormalizeX({!r})".format([(t.normalized, t.original) for t in self.data])\n    \n    def _from_sequence(self,):\n        pass\n    \n    def _from_factorized(self,):\n        pass\n    \n    def __getitem__(self, key):\n        return self.data[key]\n    \n    # def __setitem__(self, key, value):\n    #     self.normalized[key] = value\n    #     return self\n    \n    def __len__(self,):\n        return len(self.data)\n    \n    def __eq__(self, other):\n        return False\n    \n    def dtype(self,):\n        #\xc2\xa0return self._dtype\n        return object\n    \n    def nbytes(self,):\n        return sys.getsizeof(self.data)\n    \n    def isna(self,):\n        return False\n    \n    def take(self,):\n        pass\n    \n    def copy(self,):\n        return type(self)(self.data)\n    \n    def _concat_same_type(self,):\n        pass\n
Run Code Online (Sandbox Code Playgroud)\n

此外,要在该类上定义自定义方法,您必须定义一个自定义 Series 访问器,如下所示:

\n
from pandas.api.extensions import ExtensionDtype, register_extension_dtype\nimport numpy as np\n\n@pd.api.extensions.register_extension_dtype\nclass NormXType(ExtensionDtype):\n    name = \'normX\'\n    type = NormX\n    kind = \'O\'\n    na_value = np.nan\n
Run Code Online (Sandbox Code Playgroud)\n

通过这种方式,NormalizeX自定义数组实现了所有请求的方法,以成功集成到 Series 和 DataFrame 中。因此,您的示例简化为:

\n
from pandas.api.extensions import ExtensionArray\n\nclass NormalizeX(ExtensionArray):\n    def __init__(self, values):\n        self.data = values\n        \n    def __repr__(self,):\n        return "NormalizeX({!r})".format([(t.normalized, t.original) for t in self.data])\n    \n    def _from_sequence(self,):\n        pass\n    \n    def _from_factorized(self,):\n        pass\n    \n    def __getitem__(self, key):\n        return self.data[key]\n    \n    # def __setitem__(self, key, value):\n    #     self.normalized[key] = value\n    #     return self\n    \n    def __len__(self,):\n        return len(self.data)\n    \n    def __eq__(self, other):\n        return False\n    \n    def dtype(self,):\n        #\xc2\xa0return self._dtype\n        return object\n    \n    def nbytes(self,):\n        return sys.getsizeof(self.data)\n    \n    def isna(self,):\n        return False\n    \n    def take(self,):\n        pass\n    \n    def copy(self,):\n        return type(self)(self.data)\n    \n    def _concat_same_type(self,):\n        pass\n
Run Code Online (Sandbox Code Playgroud)\n


qua*_*man 3

对我来说实现您想要的功能太困难了,所以我只分享我在调查中发现的内容,希望它可能对其他回答者有用。

问题原因:

您收到这些属性错误的原因是对Series您传递给DataFrame.

s的简要检查id

您可以通过以下代码快速确认whatout和refer的区别:df['new']

out = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new'] = out
print(id(out))
print(id(df['new']))
Run Code Online (Sandbox Code Playgroud)
1861777917792
1861770685504
Run Code Online (Sandbox Code Playgroud)

你们可以看到out并且df['new']因为这种差异而彼此不同id

让我们深入研究 pandas 源代码,看看这里发生了什么。

DataFrame._set_item方法:

在类的定义中DataFrame_set_item当您尝试添加SeriesDataFrame指定列时,方法会起作用。

1861777917792
1861770685504
Run Code Online (Sandbox Code Playgroud)

在此方法中,value = self._sanitize_column(value)在除文档字符串之外的第一行中。这种_sanitize_column方法实际上破坏了你原来的Series功能。如果你深入挖掘这个方法,你最终会得到以下几行

    def _set_item(self, key, value) -> None:
        """
        Add series to DataFrame in specified column.
        If series is a numpy-array (not a Series/TimeSeries), it must be the
        same length as the DataFrames index or an error will be thrown.
        Series/TimeSeries will be conformed to the DataFrames index to
        ensure homogeneity.
        """
        value = self._sanitize_column(value)
Run Code Online (Sandbox Code Playgroud)

value._values.copy()是属性消失的直接原因NormalizeX。它只是复制给定的值Series。因此,_set_item应该修改该方法以保护NormalizeX属性。

结论:

您必须重写该类DataFrame才能将其设置NormalizeX在与其属性保持一致的指定列中。