san*_*tle 11 python immutability pandas
我感兴趣的是一个在程序中用作引用表的Immutable DataFrame,在最初构造之后强制执行read_only属性(在我的情况下是在类def __init__()
方法中).
我看到索引对象已冻结.
有没有办法使整个DataFrame不可变?
fle*_*one 16
该StaticFrame包(其中我是作者)实现了一个熊猫一样的界面,和许多常见的熊猫操作,而在与NumPy阵列和不可变系列和框架集装箱底层执行不变性。
您可以通过Frame
使用static_frame.Frame.from_pandas(df)
. 然后您可以将其用作真正的只读表。
请参阅此方法的 StaticFrame 文档:https ://static-frame.readthedocs.io/en/latest/api_detail/frame.html#frame-constructor
试试像这样的代码
class Bla(object):
def __init__(self):
self._df = pd.DataFrame(index=[1,2,3])
@property
def df(self):
return self._df.copy()
Run Code Online (Sandbox Code Playgroud)
这将允许您使用b.df返回df,但您将无法分配它.所以简而言之,你在类中的df表现在"Immutable DataFrame"中,纯粹是因为它阻止了对原始文件的更改.但是,返回的对象仍然是一个可变数据框,因此在其他方面它不会像一个不可变的数据框.即你将无法使用它作为字典等的关键.
如果您确实想让DataFrame
行为不可变,而不是使用copy
@Joop 的解决方案(我推荐),您可以构建以下结构。
请注意,这只是一个起点。
它基本上是一个代理数据对象,隐藏所有会改变状态的东西并允许对其自身进行哈希处理,并且相同原始数据的所有实例将具有相同的哈希值。可能有一些模块可以以更酷的方式执行以下操作,但我认为它可以作为一个示例具有教育意义。
一些警告:
根据代理对象的字符串表示的构造方式,两个不同的代理对象可以获得相同的 hashDataFrame
,但是该实现与其他对象中的 s兼容。
对原始对象的更改,将影响代理对象。
如果另一个对象将相等性问题抛回来,则相等性将导致一些令人讨厌的无限要求(这就是为什么list
有一个特殊情况)。
代理DataFrame
制造商帮助程序只是一个开始,问题是任何改变原始对象状态的方法都不能被允许,或者需要由帮助程序手动覆盖或在extraFilter
实例化时完全被-参数屏蔽_ReadOnly
。请参阅DataFrameProxy.sort
.
代理不会显示为派生自代理类型。
通用只读代理
这可以用在任何物体上。
import md5
import warnings
class _ReadOnly(object):
def __init__(self, obj, extraFilter=tuple()):
self.__dict__['_obj'] = obj
self.__dict__['_d'] = None
self.__dict__['_extraFilter'] = extraFilter
self.__dict__['_hash'] = int(md5.md5(str(obj)).hexdigest(), 16)
@staticmethod
def _cloak(obj):
try:
hash(obj)
return obj
except TypeError:
return _ReadOnly(obj)
def __getitem__(self, value):
return _ReadOnly._cloak(self._obj[value])
def __setitem__(self, key, value):
raise TypeError(
"{0} has a _ReadOnly proxy around it".format(type(self._obj)))
def __delitem__(self, key):
raise TypeError(
"{0} has a _ReadOnly proxy around it".format(type(self._obj)))
def __getattr__(self, value):
if value in self.__dir__():
return _ReadOnly._cloak(getattr(self._obj, value))
elif value in dir(self._obj):
raise AttributeError("{0} attribute {1} is cloaked".format(
type(self._obj), value))
else:
raise AttributeError("{0} has no {1}".format(
type(self._obj), value))
def __setattr__(self, key, value):
raise TypeError(
"{0} has a _ReadOnly proxy around it".format(type(self._obj)))
def __delattr__(self, key):
raise TypeError(
"{0} has a _ReadOnly proxy around it".format(type(self._obj)))
def __dir__(self):
if self._d is None:
self.__dict__['_d'] = [
i for i in dir(self._obj) if not i.startswith('set')
and i not in self._extraFilter]
return self._d
def __repr__(self):
return self._obj.__repr__()
def __call__(self, *args, **kwargs):
if hasattr(self._obj, "__call__"):
return self._obj(*args, **kwargs)
else:
raise TypeError("{0} not callable".format(type(self._obj)))
def __hash__(self):
return self._hash
def __eq__(self, other):
try:
return hash(self) == hash(other)
except TypeError:
if isinstance(other, list):
try:
return all(zip(self, other))
except:
return False
return other == self
Run Code Online (Sandbox Code Playgroud)
DataFrame 代理
确实应该用更多的方法进行扩展,例如sort
过滤所有其他不感兴趣的状态改变方法。
您可以使用DataFrame
-instance 作为唯一参数进行实例化,也可以像创建一个实例一样为其提供参数DataFrame
import pandas as pd
class DataFrameProxy(_ReadOnly):
EXTRA_FILTER = ('drop', 'drop_duplicates', 'dropna')
def __init__(self, *args, **kwargs):
if (len(args) == 1 and
not len(kwargs) and
isinstance(args, pd.DataFrame)):
super(DataFrameProxy, self).__init__(args[0],
DataFrameProxy.EXTRA_FILTER)
else:
super(DataFrameProxy, self).__init__(pd.DataFrame(*args, **kwargs),
DataFrameProxy.EXTRA_FILTER)
def sort(self, inplace=False, *args, **kwargs):
if inplace:
warnings.warn("Inplace sorting overridden")
return self._obj.sort(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)
最后:
然而,虽然制作这个装置很有趣,但为什么不干脆拥有一个DataFrame
你不改变的东西呢?如果它只暴露给你,最好你确保不要改变它......
通过研究pandas
Pandas 的实现和利用功能,可以修补 DataFrame 对象以实现这一行为。我实现了一个名为make_dataframe_immutable(dataframe)
解决这个问题的方法。写给熊猫==0.25.3,
编辑:为 pandas==1.0.5 和 pandas==1.1.4 添加了一个解决方案
新的 Pandas 版本可能需要调整 - 希望通过使用下面的测试不会太难做到。
这个解决方案是新的,没有经过彻底的测试——每一个反馈都将不胜感激。
如果有人可以在这里发布逆make_dataframe_mutable()
方法,那就太好了。
import functools
import numpy as np
import pandas as pd
from pandas.core.indexing import _NDFrameIndexer
def make_dataframe_immutable(df: pd.DataFrame):
"""
Makes the given DataFrame immutable.
I.e. after calling this method - one cannot modify the dataframe using pandas interface.
Upon a trial to modify an immutable dataframe, an exception of type ImmutablePandas is raised.
"""
if getattr(df, "_is_immutable", False):
return
df._is_immutable = True
df._set_value = functools.wraps(df._set_value)(_raise_immutable_exception)
df._setitem_slice = functools.wraps(df._setitem_slice)(_raise_immutable_exception)
df._setitem_frame = functools.wraps(df._setitem_frame)(_raise_immutable_exception)
df._setitem_array = functools.wraps(df._setitem_array)(_raise_immutable_exception)
df._set_item = functools.wraps(df._set_item)(_raise_immutable_exception)
df._data.delete = functools.wraps(df._data.delete)(_raise_immutable_exception)
df.update = functools.wraps(df.update)(_raise_immutable_exception)
df.insert = functools.wraps(df.insert)(_raise_immutable_exception)
df._get_item_cache = _make_result_immutable(df._get_item_cache)
# prevent modification through numpy arrays
df._data.as_array = _make_numpy_result_readonly(df._data.as_array)
_prevent_inplace_argument_in_function_calls(
df,
# This list was obtained by manual inspection +
# [attr for attr in dir(d) if hasattr(getattr(pd.DataFrame, attr, None), '__code__') and
# 'inplace' in getattr(pd.DataFrame, attr).__code__.co_varnames]
(
'bfill',
'clip',
'clip_lower',
'clip_upper',
'drop',
'drop_duplicates',
'dropna',
'eval',
'ffill',
'fillna',
'interpolate',
'mask',
'query',
'replace',
'reset_index',
'set_axis',
'set_index',
'sort_index',
'sort_values',
'where',
"astype",
"assign",
"reindex",
"rename",
),
)
def make_series_immutable(series: pd.Series):
"""
Makes the given Series immutable.
I.e. after calling this method - one cannot modify the series using pandas interface.
Upon a trial to modify an immutable dataframe, an exception of type ImmutablePandas is raised.
"""
if getattr(series, "_is_immutable", False):
return
series._is_immutable = True
series._set_with_engine = functools.wraps(series._set_with_engine)(_raise_immutable_exception)
series._set_with = functools.wraps(series._set_with)(_raise_immutable_exception)
series.set_value = functools.wraps(series.set_value)(_raise_immutable_exception)
# prevent modification through numpy arrays
series._data.external_values = _make_numpy_result_readonly(series._data.external_values)
series._data.internal_values = _make_numpy_result_readonly(series._data.internal_values)
series._data.get_values = _make_numpy_result_readonly(series._data.get_values)
_prevent_inplace_argument_in_function_calls(
series,
# This list was obtained by manual inspection +
# [attr for attr in dir(d) if hasattr(getattr(pd.Series, attr, None), '__code__') and
# 'inplace' in getattr(pd.Series, attr).__code__.co_varnames]
(
"astype",
'bfill',
'clip',
'clip_lower',
'clip_upper',
'drop',
'drop_duplicates',
'dropna',
'ffill',
'fillna',
'interpolate',
'mask',
'replace',
'reset_index',
'set_axis',
'sort_index',
'sort_values',
"valid",
'where',
"_set_name",
),
)
class ImmutablePandas(Exception):
pass
def _raise_immutable_exception(*args, **kwargs):
raise ImmutablePandas(f"Cannot modify immutable dataframe. Please use df.copy()")
def _get_df_or_series_from_args(args):
if len(args) >= 2 and (isinstance(args[1], pd.DataFrame) or isinstance(args[1], pd.Series)):
return args[1]
def _safe__init__(self, *args, **kwargs):
super(_NDFrameIndexer, self).__init__(*args, **kwargs)
df_or_series = _get_df_or_series_from_args(args)
if df_or_series is not None:
if getattr(df_or_series, "_is_immutable", False):
self._get_setitem_indexer = functools.wraps(self._get_setitem_indexer)(_raise_immutable_exception)
# This line is the greatest foul in this module - as it performs a global patch.
# Notice that a reload of this module incurs overriding this variable again and again. It is supported.
_NDFrameIndexer.__init__ = functools.wraps(_NDFrameIndexer.__init__)(_safe__init__)
def _make_numpy_result_readonly(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
res = func(*args, **kwargs)
if isinstance(res, np.ndarray):
res.flags.writeable = False
return res
return wrapper
def _make_result_immutable(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
res = func(*args, **kwargs)
if isinstance(res, pd.Series):
make_series_immutable(res)
return res
return wrapper
def _prevent_inplace_operation(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
# TODO: here we assume that in-place is not given as a positional.
# remove this assumption, either by hard-coding the position for each method or by parsing the
# function signature.
if kwargs.get("inplace", False):
_raise_immutable_exception()
return func(*args, **kwargs)
return wrapper
def _prevent_inplace_argument_in_function_calls(obj, attributes):
for attr in attributes:
member = getattr(obj, attr)
setattr(obj, attr, _prevent_inplace_operation(member))
Run Code Online (Sandbox Code Playgroud)
pytest 单元测试
import immutable_pandas
import importlib
import warnings
import pandas as pd
import pytest
def create_immutable_dataframe() -> pd.DataFrame:
# Cannot be used as a fixture because pytest copies objects transparently, which makes the tests flaky
immutable_dataframe = pd.DataFrame({"x": [1, 2, 3, 4], "y": [4, 5, 6, 7]})
make_dataframe_immutable(immutable_dataframe)
return immutable_dataframe
def test_immutable_dataframe_cannot_change_with_direct_access():
immutable_dataframe = create_immutable_dataframe()
immutable_dataframe2 = immutable_dataframe.query("x == 2")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
immutable_dataframe2["moshe"] = 123
with pytest.raises(ImmutablePandas):
immutable_dataframe.x = 2
with pytest.raises(ImmutablePandas):
immutable_dataframe["moshe"] = 56
with pytest.raises(ImmutablePandas):
immutable_dataframe.insert(0, "z", [1, 2, 3, 4])
def test_immutable_dataframe_cannot_change_with_inplace_operations():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe.eval("y=x+1", inplace=True)
with pytest.raises(ImmutablePandas):
immutable_dataframe.assign(y=2, inplace=True)
def test_immutable_dataframe_cannot_change_with_loc():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe.loc[2] = 1
with pytest.raises(ImmutablePandas):
immutable_dataframe.iloc[1] = 4
def test_immutable_dataframe_cannot_change_with_columns_access():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe["x"][2] = 123
with pytest.raises(ImmutablePandas):
immutable_dataframe["x"].loc[2] = 123
def test_immutable_dataframe_cannot_del_column():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
del immutable_dataframe["x"]
def test_immutable_dataframe_cannot_be_modified_through_values():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ValueError, match="read-only"):
immutable_dataframe.values[0, 0] = 1
with pytest.raises(ValueError, match="read-only"):
immutable_dataframe.as_matrix()[0, 0] = 1
def test_immutable_series_cannot_change_with_loc():
series = pd.Series([1, 2, 3, 4])
make_series_immutable(series)
with pytest.raises(ImmutablePandas):
series.loc[0] = 1
with pytest.raises(ImmutablePandas):
series.iloc[0] = 1
def test_immutable_series_cannot_change_with_inplace_operations():
series = pd.Series([1, 2, 3, 4])
make_series_immutable(series)
with pytest.raises(ImmutablePandas):
series.sort_index(inplace=True)
with pytest.raises(ImmutablePandas):
series.sort_values(inplace=True)
with pytest.raises(ImmutablePandas):
series.astype(int, inplace=True)
def test_series_cannot_be_modeified_through_values():
series = pd.Series([1, 2, 3, 4])
make_series_immutable(series)
with pytest.raises(ValueError, match="read-only"):
series.get_values()[0] = 1234
series = pd.Series([1, 2, 3, 4])
make_series_immutable(series)
with pytest.raises(ValueError, match="read-only"):
series.values[0] = 1234
def test_reloading_module_immutable_pandas_does_not_break_immutability():
# We need to test the effects of reloading the module, because we modify the global variable
# _NDFrameIndexer.__init__ upon every reload of the module.
df = create_immutable_dataframe()
df2 = df.copy()
immutable_pandas2 = importlib.reload(immutable_pandas)
with pytest.raises(immutable_pandas2.ImmutablePandas):
df.astype(int, inplace=True)
df2.astype(int, inplace=True)
immutable_pandas2.make_dataframe_immutable(df2)
with pytest.raises(immutable_pandas2.ImmutablePandas):
df2.astype(int, inplace=True)
Run Code Online (Sandbox Code Playgroud)
编辑:这是在 pandas==1.0.5 和 pandas==1.1.4 上测试的更新
"""
Two methods to make pandas objects immutable.
make_dataframe_immutable()
make_series_immutable()
"""
import functools
import numpy as np
import pandas as pd
from pandas.core.indexing import _iLocIndexer
from pandas.core.indexing import _LocIndexer
from pandas.core.indexing import IndexingMixin
def make_dataframe_immutable(df: pd.DataFrame):
"""
Makes the given DataFrame immutable.
I.e. after calling this method - one cannot modify the dataframe using pandas interface.
Upon a trial to modify an immutable dataframe, an exception of type ImmutablePandas is raised.
"""
if getattr(df, "_is_immutable", False):
return
df._is_immutable = True
df._set_value = functools.wraps(df._set_value)(_raise_immutable_exception)
df._setitem_slice = functools.wraps(df._setitem_slice)(_raise_immutable_exception)
df._setitem_frame = functools.wraps(df._setitem_frame)(_raise_immutable_exception)
df._setitem_array = functools.wraps(df._setitem_array)(_raise_immutable_exception)
df._set_item = functools.wraps(df._set_item)(_raise_immutable_exception)
if hasattr(df, "_mgr"):
# pandas==1.1.4
df._mgr.idelete = functools.wraps(df._mgr.idelete)(_raise_immutable_exception)
elif hasattr(df, "_data"):
# pandas==1.0.5
df._data.delete = functools.wraps(df._data.delete)(_raise_immutable_exception)
df.update = functools.wraps(df.update)(_raise_immutable_exception)
df.insert = functools.wraps(df.insert)(_raise_immutable_exception)
df._get_item_cache = _make_result_immutable(df._get_item_cache)
# prevent modification through numpy arrays
df._data.as_array = _make_numpy_result_readonly(df._data.as_array)
_prevent_inplace_argument_in_function_calls(
df,
# This list was obtained by manual inspection +
# [attr for attr in dir(d) if hasattr(getattr(pd.DataFrame, attr, None), '__code__') and
# 'inplace' in getattr(pd.DataFrame, attr).__code__.co_varnames]
(
"bfill",
"clip",
"drop",
"drop_duplicates",
"dropna",
"eval",
"ffill",
"fillna",
"interpolate",
"mask",
"query",
"replace",
"reset_index",
"set_axis",
"set_index",
"sort_index",
"sort_values",
"where",
"astype",
"assign",
"reindex",
"rename",
),
)
def make_series_immutable(series: pd.Series):
"""
Makes the given Series immutable.
I.e. after calling this method - one cannot modify the series using pandas interface.
Upon a trial to modify an immutable dataframe, an exception of type ImmutablePandas is raised.
"""
if getattr(series, "_is_immutable", False):
return
series._is_immutable = True
series._set_with_engine = functools.wraps(series._set_with_engine)(_raise_immutable_exception)
series._set_with = functools.wraps(series._set_with)(_raise_immutable_exception)
# prevent modification through numpy arrays
series._data.external_values = _make_numpy_result_readonly(series._data.external_values)
series._data.internal_values = _make_numpy_result_readonly(series._data.internal_values)
_prevent_inplace_argument_in_function_calls(
series,
# This list was obtained by manual inspection +
# [attr for attr in dir(d) if hasattr(getattr(pd.Series, attr, None), '__code__') and
# 'inplace' in getattr(pd.Series, attr).__code__.co_varnames]
(
"astype",
"bfill",
"clip",
"drop",
"drop_duplicates",
"dropna",
"ffill",
"fillna",
"interpolate",
"mask",
"replace",
"reset_index",
"set_axis",
"sort_index",
"sort_values",
"where",
"_set_name",
),
)
class ImmutablePandas(Exception):
pass
def _raise_immutable_exception(*args, **kwargs):
raise ImmutablePandas(f"Cannot modify immutable dataframe. Please use df.copy()")
def _get_df_or_series_from_args(args):
if len(args) >= 2 and (isinstance(args[1], pd.DataFrame) or isinstance(args[1], pd.Series)):
return args[1]
def _protect_indexer(loc_func):
def wrapper(*arg, **kwargs):
res = loc_func(*args, **kwargs)
return res
def _safe__init__(cls, self, *args, **kwargs):
super(cls, self).__init__(*args, **kwargs)
df_or_series = _get_df_or_series_from_args(args)
if df_or_series is not None:
if getattr(df_or_series, "_is_immutable", False):
self._get_setitem_indexer = functools.wraps(self._get_setitem_indexer)(_raise_immutable_exception)
@functools.wraps(IndexingMixin.loc)
def _safe_loc(self):
loc = _LocIndexer("loc", self)
if getattr(self, "_is_immutable", False):
# Edit also loc._setitem_with_indexer
loc._get_setitem_indexer = functools.wraps(loc._get_setitem_indexer)(_raise_immutable_exception)
return loc
@functools.wraps(IndexingMixin.iloc)
def _safe_iloc(self):
iloc = _iLocIndexer("iloc", self)
if getattr(self, "_is_immutable", False):
# Edit also iloc._setitem_with_indexer
iloc._get_setitem_indexer = functools.wraps(iloc._get_setitem_indexer)(_raise_immutable_exception)
return iloc
# wraps
pd.DataFrame.loc = property(_safe_loc)
pd.Series.loc = property(_safe_loc)
pd.DataFrame.iloc = property(_safe_iloc)
pd.Series.iloc = property(_safe_iloc)
def _make_numpy_result_readonly(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
res = func(*args, **kwargs)
if isinstance(res, np.ndarray):
res.flags.writeable = False
return res
return wrapper
def _make_result_immutable(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
res = func(*args, **kwargs)
if isinstance(res, pd.Series):
make_series_immutable(res)
return res
return wrapper
def _prevent_inplace_operation(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
# TODO: here we assume that in-place is not given as a positional.
# remove this assumption, either by hard-coding the position for each method or by parsing the
# function signature.
if kwargs.get("inplace", False):
_raise_immutable_exception()
return func(*args, **kwargs)
return wrapper
def _prevent_inplace_argument_in_function_calls(obj, attributes):
for attr in attributes:
member = getattr(obj, attr)
setattr(obj, attr, _prevent_inplace_operation(member))
Run Code Online (Sandbox Code Playgroud)
和 pytest 文件
import importlib
import warnings
import pandas as pd
import pytest
import immutable_pandas
from immutable_pandas import ImmutablePandas
from immutable_pandas import make_dataframe_immutable
from immutable_pandas import make_series_immutable
def create_immutable_dataframe() -> pd.DataFrame:
# Cannot be used as a fixture because pytest copies objects transparently, which makes the tests flaky
immutable_dataframe = pd.DataFrame({"x": [1, 2, 3, 4], "y": [4, 5, 6, 7]})
make_dataframe_immutable(immutable_dataframe)
return immutable_dataframe
def test_immutable_dataframe_cannot_change_with_direct_access():
immutable_dataframe = create_immutable_dataframe()
immutable_dataframe2 = immutable_dataframe.query("x == 2")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
immutable_dataframe2["moshe"] = 123
with pytest.raises(ImmutablePandas):
immutable_dataframe.x = 2
with pytest.raises(ImmutablePandas):
immutable_dataframe["moshe"] = 56
with pytest.raises(ImmutablePandas):
immutable_dataframe.insert(0, "z", [1, 2, 3, 4])
def test_immutable_dataframe_cannot_change_with_inplace_operations():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe.eval("y=x+1", inplace=True)
with pytest.raises(ImmutablePandas):
immutable_dataframe.assign(y=2, inplace=True)
def test_immutable_dataframe_cannot_change_with_loc():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe.loc[2] = 1
with pytest.raises(ImmutablePandas):
immutable_dataframe.iloc[1] = 4
def test_immutable_dataframe_cannot_change_with_columns_access():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
immutable_dataframe["x"][2] = 123
with pytest.raises(ImmutablePandas):
immutable_dataframe["x"].loc[2] = 123
def test_immutable_dataframe_cannot_del_column():
immutable_dataframe = create_immutable_dataframe()
with pytest.raises(ImmutablePandas):
del immutable_dataframe["x"]
def test_immutable_dataframe_cannot_be_modified_through_values():
immutable_dataframe = create_immutable_dataframe()
wit