在Pandas Dataframe中保存其他属性

tnk*_*epp 24 python-2.7 pandas

我记得在MatLab的日子里使用结构化数组,你可以将不同的数据存储为主结构的属性.就像是:

a = {}
a.A = magic(10)
a.B = magic(50); etc.
Run Code Online (Sandbox Code Playgroud)

其中aA和aB彼此完全分开,允许您在a中存储不同的类型并根据需要对它们进行操作.熊猫允许我们做类似的事情,但不完全相同.

我正在使用Pandas并希望存储数据帧的属性而不实际将其放在数据帧中.这可以通过以下方式完成:

import pandas as pd

a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=list('ABCED')

# now store an attribute of <a>
a.local_tz = 'US/Eastern'
Run Code Online (Sandbox Code Playgroud)

现在,本地时区存储在a中,但是当我保存数据帧时我无法保存此属性(即重新加载后没有a.local_tz).有没有办法保存这些属性?

目前,我只是在数据框中创建新列来保存时区,纬度,长期等信息,但这似乎有点浪费.此外,当我对数据进行分析时遇到了必须排除这些其他列的问题.

################## BEGIN EDIT ##################

使用unutbu的建议,我现在以h5格式存储数据.如上所述,将元数据作为数据帧的属性重新加载是有风险的.但是,由于我是这些文件(和处理算法)的创建者,我可以选择存储为元数据的内容和不存储的内容.在处理将进入h5文件的数据时,我选择将元数据存储在初始化为类的属性的字典中.我创建了一个简单的IO类来导入h5数据,并将元数据作为类属性.现在我可以处理我的数据帧而不会丢失元数据.

class IO():
    def __init__(self):
        self.dtfrmt = 'dummy_str'

    def h5load(self,filename,update=False):
        '''h5load loads the stored HDF5 file.  Both the dataframe (actual data) and 
        the associated metadata are stored in the H5file

        NOTE: This does not load "any" H5 
        file, it loads H5 files specifically created to hold dataframe data and 
        metadata.

        When multi-indexed dataframes are stored in the H5 format the date 
        values (previously initialized with timezone information) lose their
        timezone localization.  Therefore, <h5load> re-localizes the 'DATE' 
        index as UTC.

        Parameters
        ----------
        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        udatedf : boolean True/False
            default: False
            If the selected dataframe is to be updated then it is imported 
            slightly different.  If update==True, the <metadata> attribute is
            returned as a dictionary and <data> is returned as a dataframe 
            (i.e., as a stand-alone dictionary with no attributes, and NOT an 
            instance of the IO() class).  Otherwise, if False, <metadata> is 
            returned as an attribute of the class instance.

        Output
        ------
        data : Pandas dataframe with attributes
            The dataframe contains only the data as collected by the instrument.  
            Any metadata (e.g. timezone, scaling factor, basically anything that
            is constant throughout the file) is stored as an attribute (e.g. lat 
            is stored as <data.lat>).'''

        with pd.HDFStore(filename,'r') as store:
            self.data = store['mydata']
            self.metadata = store.get_storer('mydata').attrs.metadata    # metadata gets stored as attributes, so no need to make <metadata> an attribute of <self>

            # put metadata into <data> dataframe as attributes
            for r in self.metadata:
                setattr(self,r,self.metadata[r])

        # unscale data
        self.data, self.metadata = unscale(self.data,self.metadata,stringcols=['routine','date'])

        # when pandas stores multi-index dataframes as H5 files the timezone
        # initialization is lost.  Remake index with timezone initialized: only
        # for multi-indexed dataframes
        if isinstance(self.data.index,pd.core.index.MultiIndex):
            # list index-level names, and identify 'DATE' level
            namen = self.data.index.names
            date_lev = namen.index('DATE')

            # extract index as list and remake tuples with timezone initialized
            new_index = pd.MultiIndex.tolist(self.data.index)
            for r in xrange( len(new_index) ):
                tmp = list( new_index[r] )
                tmp[date_lev] = utc.localize( tmp[date_lev] )

                new_index[r] = tuple(tmp)

            # reset multi-index
            self.data.index = pd.MultiIndex.from_tuples( new_index, names=namen )


        if update:
            return self.metadata, self.data
        else:
            return self





    def h5store(self,data, filename, **kwargs):
        '''h5store stores the dataframe as an HDF5 file.  Both the dataframe 
        (actual data) and the associated metadata are stored in the H5file

        Parameters
        ----------
        data : Pandas dataframe NOT a class instance
            Must be a dataframe, not a class instance (i.e. cannot be an instance 
            named <data> that has an attribute named <data> (e.g. the Pandas 
            data frame is stored in data.data)).  If the dataframe is under
            data.data then the input variable must be data.data.

        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        **kwargs : dictionary
            dictionary containing metadata information.


        Output
        ------
        None: only saves data to file'''

        with pd.HDFStore(filename,'w') as store:
            store.put('mydata',data)
            store.get_storer('mydata').attrs.metadata = kwargs
Run Code Online (Sandbox Code Playgroud)

然后通过data = IO().h5load('filename.h5')加载H5文件.数据帧存储在data.data下.我保留data.metadata下的元数据字典,并创建了单独的元数据属性(例如,data.lat创建自data.metadata [ 'LAT']).

我的索引时间戳已本地化为pytz.utc().但是,当多索引数据帧存储到h5时,时区本地化将丢失(使用Pandas 15.2),因此我在IO().h5load中对此进行了更正.

unu*_*tbu 33

关于在NDFrame中存储自定义元数据存在一个未解决的问题.但由于pandas函数可能返回DataFrames的众多方式,因此_metadata在所有情况下都不会保留该属性.

目前,您只需将元数据存储在辅助变量中.

有多种选项可以将DataFrames +元数据存储到文件中,具体取决于您希望使用的格式 - pickle,JSON,HDF5都是可能的.

以下是使用HDF5存储和加载带元数据的DataFrame的方法.存储元数据的方法来自Pandas Cookbook.

import numpy as np
import pandas as pd

def h5store(filename, df, **kwargs):
    store = pd.HDFStore(filename)
    store.put('mydata', df)
    store.get_storer('mydata').attrs.metadata = kwargs
    store.close()

def h5load(store):
    data = store['mydata']
    metadata = store.get_storer('mydata').attrs.metadata
    return data, metadata

a = pd.DataFrame(
    data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED'))

filename = '/tmp/data.h5'
metadata = dict(local_tz='US/Eastern')
h5store(filename, a, **metadata)
with pd.HDFStore(filename) as store:
    data, metadata = h5load(store)

print(data)
#     A   B   C   E   D
# 0   9  20  92  43  25
# 1   2  64  54   0  63
# 2  22  42   3  83  81
# 3   3  71  17  64  53
# 4  52  10  41  22  43
# 5  48  85  96  72  88
# 6  10  47   2  10  78
# 7  30  80   3  59  16
# 8  13  52  98  79  65
# 9   6  93  55  40   3
Run Code Online (Sandbox Code Playgroud)
print(metadata)
Run Code Online (Sandbox Code Playgroud)

产量

{'local_tz': 'US/Eastern'}
Run Code Online (Sandbox Code Playgroud)


The*_*Cat 8

我使用的方法是添加额外的MultiIndex级别来存储我想要的其他信息(我使用列,但任何一个都可以工作).所有列对于这些附加参数都具有相同的值.这也很有用,因为我可以组合多个数据框或拆分单个列,并保留这些值.

>>> col=pd.MultiIndex.from_product([['US/Eastern'], ['A', 'B', 'C', 'E', 'D']], names=['local_tz', 'name'])
>>> a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=col)
>>> print(a)
local_tz US/Eastern                
name              A   B   C   E   D
0                38  93  63  24  55
1                21  25  84  98  62
2                 4  60  78   0   5
3                26  50  82  89  23
4                32  70  80  90   1
5                 6  17   8  60  59
6                95  98  69  19  76
7                71  90  45  45  40
8                94  16  44  60  16
9                53   8  30   4  72
Run Code Online (Sandbox Code Playgroud)


Joh*_*hnE 6

Pandas 现在有一个属性“attrs”,用于存储元数据。正如文档中所述,它是实验性的,将来可能会发生变化。文档中没有太多其他内容,虽然非常有限,但也非常容易访问和使用:

df = pd.DataFrame({'fruit':['apple','pear','banana'], 'price':[3,4,2]})

df.attrs = { 'description':'fruit prices in dollars per pound' }
Run Code Online (Sandbox Code Playgroud)

当您创建“df”时,“df.attrs”被初始化为空字典。除此之外,看来您可以在“df.attrs”中存储您想要的任何内容,只要顶层是字典即可。上面,我只是存储了数据框的描述,但另一个选择是存储列标签(例如,请参阅我的答案: How to handle meta data Associated with a pandas dataframe?

持久性似乎仅限于少量数据帧操作。例如,“attrs”中的值会被 copy、loc 和 iloc 保留,但不会被 groupby 保留。当然,人们总是可以通过一项简单的任务重新连接。例如,如果您通过 groupby 操作创建一个名为“grouped_means”的数据框,它将不会保留“attrs”,但您可以轻松地重新附加:

grouped_means.attrs = df.attrs
Run Code Online (Sandbox Code Playgroud)

也许这可以在 pandas 的未来版本中自动实现。因此,它的功能非常有限,但非常易于使用,并且将来可能会得到改进。