Pandas HDFStore的MultiIndex DataFrames:如何有效地获取所有索引

Question

Pandas HDFStore的MultiIndex DataFrames:如何有效地获取所有索引

Ton*_*ony 5 python pandas hdfstore

在Pandas中,有没有办法以表格格式有效地提取HDFStore中存在的所有MultiIndex索引？

我可以select()有效地使用where=,但我想要所有索引,而不是所有列.我也可以select()使用iterator=True保存RAM,但这仍然意味着从磁盘读取几乎所有的表,所以它仍然很慢.

我一直在store.root..table.*东西打猎,希望我能得到一个索引值列表.我是在正确的轨道上吗？

计划B将保留一个较短的MultiIndex DataFrame,它只包含每次附加主数据时附加的空DataFrame.我可以检索它并使索引比主要索引便宜得多.虽然不太优雅.

Answer 1

Jef*_*eff 6

创建一个多索引df

In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))

In [36]: df['one'] = 'foo'

In [37]: df['two'] = 'bar'

In [38]: df.ix[50000:,'two'] = 'bah'

In [40]: mi = df.set_index(['one','two'])

In [41]: mi
Out[41]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A    100000  non-null values
B    100000  non-null values
C    100000  non-null values
dtypes: float64(3)

Run Code Online (Sandbox Code Playgroud)

将其存储为表格

In [42]: store = pd.HDFStore('test.h5',mode='w')

In [43]: store.append('df',mi)

Run Code Online (Sandbox Code Playgroud)

get_storer 将返回存储的对象(但不检索数据)

In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])

Run Code Online (Sandbox Code Playgroud)

索引级别创建为data_columns,这意味着您可以在选择中使用它们这是如何仅选择索引

In [48]: store.select('df',columns=['one'])
Out[48]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame

Run Code Online (Sandbox Code Playgroud)

选择单个列并将其作为mi-frame返回

In [49]: store.select('df',columns=['A'])
Out[49]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A    100000  non-null values
dtypes: float64(1)

Run Code Online (Sandbox Code Playgroud)

要将单个列选择为Series(也可以是索引,因为它们存储为列).这将非常快.

In [2]: store.select_column('df','one')
Out[2]: 
0     foo
1     foo
2     foo
3     foo
4     foo
5     foo
6     foo
7     foo
8     foo
9     foo
10    foo
11    foo
12    foo
13    foo
14    foo
...
99985    foo
99986    foo
99987    foo
99988    foo
99989    foo
99990    foo
99991    foo
99992    foo
99993    foo
99994    foo
99995    foo
99996    foo
99997    foo
99998    foo
99999    foo
Length: 100000, dtype: object

Run Code Online (Sandbox Code Playgroud)

如果你真的想要最快的选择只有索引

In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop

In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop

Run Code Online (Sandbox Code Playgroud)

或者获得完整的索引

In [6]: def f():
   ...:     level_1 =  store.select_column('df','one')
   ...:     level_2 =  store.select_column('df','two')
   ...:     return MultiIndex.from_arrays([ level_1, level_2 ])
   ...: 

In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop

Run Code Online (Sandbox Code Playgroud)

如果你想要每个级别的值,这是一种非常快速的方法

In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)

In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，4 月前
查看次数：	2680 次
最近记录：	11 年，10 月前