在pandas MultiIndex DataFrame中选择行

cs9*_*s95 78 python slice multi-index dataframe pandas

目标和动机

(x, y)API已经日益普及,多年来,然而,没有关于它的一切是完全理解在结构方面,工作和相关的操作.

一个重要的操作是过滤.过滤是一种常见的要求,但用例是多种多样的.因此,某些方法和功能将比其他用例更适用于某些用例.

总之,本文的目的是触及一些常见的过滤问题和用例,演示解决这些问题的各种不同方法,并讨论它们的适用性.本文试图解决的一些高级问题是

  • 基于单个值/标签切片
  • 基于来自一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪种方法适用于什么情况

这些问题已分解为6个具体问题,如下所列.为简单起见,以下设置中的示例DataFrame仅具有两个级别,并且没有重复的索引键.提出问题的大多数解决方案可以推广到N级.

本文不会介绍如何创建MultiIndexes,如何对它们执行赋值操作,或任何与性能相关的讨论(这些是另一个时间的单独主题).


问题

问题1-6将在上下文中询问下面的设置.

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15
Run Code Online (Sandbox Code Playgroud)

问题1:选择单个项目
如何在"1"级中选择"a"的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
Run Code Online (Sandbox Code Playgroud)

另外,我怎样才能在输出中删除"1"级?

     col
two     
t      0
u      1
v      2
w      3
Run Code Online (Sandbox Code Playgroud)

问题1b
如何在级别"2"上切换值为"t"的所有行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12
Run Code Online (Sandbox Code Playgroud)

问题2:在一个级别中选择多个值
如何在"一个"级别中选择与"b"和"d"项对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15
Run Code Online (Sandbox Code Playgroud)

问题2b
如何在"2"级中获得与"t"和"w"对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15
Run Code Online (Sandbox Code Playgroud)

问题3:切片单个横截面df
如何检索横截面,即具有特定索引值的单行('c', 'u')?具体来说,我如何检索[(a, b), (c, d), ...]由...给出的横截面

         col
one two     
c   u      9
Run Code Online (Sandbox Code Playgroud)

问题4:切片多个横截面('c', 'u')
如何选择对应的两行?('a', 'w')(x, y)

         col
one two     
c   u      9
a   w      3
Run Code Online (Sandbox Code Playgroud)

问题5:每个级别
切换一个项目如何在级别"一"中检索与"a"对应的所有行,在"二"级中检索"u"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12
Run Code Online (Sandbox Code Playgroud)

问题6:任意切片
如何切割特定的横截面?对于"a"和"b",我想选择具有子级别"u"和"v"的所有行,而对于"d",我想选择具有子级别"w"的行.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15
Run Code Online (Sandbox Code Playgroud)

问题7将使用由数字级别组成的唯一设置:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15
Run Code Online (Sandbox Code Playgroud)

问题6:基于不等式的数字级别过滤
如何获得级别"2"中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15
Run Code Online (Sandbox Code Playgroud)

cs9*_*s95 84

MultiIndex/Advanced Indexing

注意
本文的结构将采用以下方式:

  1. OP中提出的问题将逐一解决
  2. 对于每个问题,将演示一种或多种适用于解决该问题并获得预期结果的方法.

注释(很像这一个)将包含在有兴趣了解其他功能,实现细节和其他信息的读者中.这些笔记是通过搜索文档和发现各种晦涩的特征以及我自己(公认的有限)经验编写的.

所有代码示例都在pandas v0.23.4,python3.7上创建和测试.如果某些内容不明确,或事实上不正确,或者您没有找到适用于您的用例的解决方案,请随时建议编辑,在评论中请求澄清,或者打开一个新问题,......如果适用.

以下是我们将经常重访的一些常见习语(以下简称四种习语)的介绍

  1. DataFrame.loc- 按标签选择的一般解决方案(+ pd.IndexSlice适用于涉及切片的更复杂应用)

  2. DataFrame.xs - 从Series/DataFrame中提取特定横截面.

  3. DataFrame.query- 动态指定切片和/或过滤操作(即,作为动态评估的表达式.更适用于某些场景而不是其他场景.另请参阅文档的此部分以查询MultiIndexes.

  4. 使用生成的掩码进行布尔索引MultiIndex.get_level_values(通常结合使用Index.isin,尤其是在使用多个值进行过滤时).这在某些情况下也非常有用.

考虑四种习语的各种切片和过滤问题以更好地理解可应用于给定情况的内容将是有益的.非常重要的是要理解并非所有习语在每种情况下都能同样有效(如果有的话).如果一个成语没有列为下面问题的潜在解决方案,那就意味着成语不能有效地应用于该问题.


问题1

如何在"一"级中选择"a"的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
Run Code Online (Sandbox Code Playgroud)

您可以使用loc适用于大多数情况的通用解决方案:

df.loc[['a']]
Run Code Online (Sandbox Code Playgroud)

在这一点上,如果你得到

TypeError: Expected tuple, got str
Run Code Online (Sandbox Code Playgroud)

这意味着你正在使用旧版本的熊猫.考虑升级!否则,请使用df.loc[('a', slice(None)), :].

或者,您可以xs在这里使用,因为我们正在提取单个横截面.注意levelsaxis参数(这里可以假设合理的默认值).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)
Run Code Online (Sandbox Code Playgroud)

在这里,drop_level=False需要参数来防止xs在结果中降低"一"级别(我们切入的级别).

这里的另一个选择是使用query:

df.query("one == 'a'")
Run Code Online (Sandbox Code Playgroud)

如果索引没有名称,则需要将查询字符串更改为"ilevel_0 == 'a'".

最后,使用get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']
Run Code Online (Sandbox Code Playgroud)

另外,我怎样才能在输出中删除"1"级?

     col
two     
t      0
u      1
v      2
w      3
Run Code Online (Sandbox Code Playgroud)

这可以使用其中任何一个轻松完成

df.loc['a'] # Notice the single string argument instead the list.
Run Code Online (Sandbox Code Playgroud)

要么,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')
Run Code Online (Sandbox Code Playgroud)

请注意,我们可以省略drop_level参数(True默认情况下它被假定).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])
Run Code Online (Sandbox Code Playgroud)

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])
Run Code Online (Sandbox Code Playgroud)

Question 1b

How do I slice all rows with value "t" on level "two"?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12
Run Code Online (Sandbox Code Playgroud)

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]
Run Code Online (Sandbox Code Playgroud)

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]
Run Code Online (Sandbox Code Playgroud)

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]
Run Code Online (Sandbox Code Playgroud)

Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)
Run Code Online (Sandbox Code Playgroud)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 
Run Code Online (Sandbox Code Playgroud)

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']
Run Code Online (Sandbox Code Playgroud)

All to the same effect.


Question 2

How can I select rows corresponding to items "b" and "d" in level "one"?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15
Run Code Online (Sandbox Code Playgroud)

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]
Run Code Online (Sandbox Code Playgroud)

To solve the above problem of selecting "b" and "d", you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')
Run Code Online (Sandbox Code Playgroud)

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn't conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]
Run Code Online (Sandbox Code Playgroud)

Question 2b

How would I get all values corresponding to "t" and "w" in level "two"?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15
Run Code Online (Sandbox Code Playgroud)

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 
Run Code Online (Sandbox Code Playgroud)

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')
Run Code Online (Sandbox Code Playgroud)

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]
Run Code Online (Sandbox Code Playgroud)

Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9
Run Code Online (Sandbox Code Playgroud)

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]
Run Code Online (Sandbox Code Playgroud)

Or,

df.loc[pd.IndexSlice[('c', 'u')]]
Run Code Online (Sandbox Code Playgroud)

Note
At this point, you may run into a PerformanceWarning that looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.
Run Code Online (Sandbox Code Playgroud)

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]
Run Code Online (Sandbox Code Playgroud)

You can also use MultiIndex.is_lexsorted() to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))
Run Code Online (Sandbox Code Playgroud)

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")
Run Code Online (Sandbox Code Playgroud)

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]
Run Code Online (Sandbox Code Playgroud)

Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3
Run Code Online (Sandbox Code Playgroud)

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]
Run Code Online (Sandbox Code Playgroud)

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)
Run Code Online (Sandbox Code Playgroud)

100% DO NOT RECOMMEND! But it is possible.


Question 5

How can I retrieve all rows corresponding to "a" in level "one" and "u" in level "two"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12
Run Code Online (Sandbox Code Playgroud)

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12
Run Code Online (Sandbox Code Playgroud)

But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]
Run Code Online (Sandbox Code Playgroud)

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")
Run Code Online (Sandbox Code Playgroud)

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 
Run Code Online (Sandbox Code Playgroud)

Question 6

How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15
Run Code Online (Sandbox Code Playgroud)

This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]
Run Code Online (Sandbox Code Playgroud)

If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)
Run Code Online (Sandbox Code Playgroud)

Slicing specification for "a" and "b" is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.


Question 7

How do I get all rows where values in level "two" are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15
Run Code Online (Sandbox Code Playgroud)

This can be done using query,

df2.query("two > 5")
Run Code Online (Sandbox Code Playgroud)

And get_level_values.

df2[df2.index.get_level_values('two') > 5]
Run Code Online (Sandbox Code Playgroud)

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.


Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7
Run Code Online (Sandbox Code Playgroud)

These are the following changes you will need to make to the Four Idioms to have them working with columns.

  1. To slice with loc, use

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    
    Run Code Online (Sandbox Code Playgroud)

    Or,

    df3.loc[:, pd.IndexSlice[...]]
    
    Run Code Online (Sandbox Code Playgroud)
  2. To use xs as appropriate, just pass an argument axis=1.

  3. You can access the column level values directly with df.columns.get_level_values. You will then need to do something like

    df.loc[:, {condition}] 
    
    Run Code Online (Sandbox Code Playgroud)

    Where {condition} represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

    df3.T.query(...).T
    
    Run Code Online (Sandbox Code Playgroud)

    Not recommended, use one of the other 3 options.


r a*_*r a 14

最近我遇到了一个用例,其中我有一个 3+ 级多索引数据框,其中我无法使上述任何解决方案产生我正在寻找的结果。很可能上述解决方案当然适用于我的用例,我尝试了几种,但是我无法在我可用的时间内让它们工作。

我远非专家,但我偶然发现了上述综合答案中未列出的解决方案。我不保证这些解决方案在任何方面都是最佳的。

这是获得与上述问题 #6 略有不同的结果的不同方法。(可能还有其他问题)

具体来说,我正在寻找:

  1. 一种从索引的一个级别中选择两个+值和从索引的另一个级别中选择一个值的方法,以及
  2. 一种在数据帧输出中保留上一操作的索引值的方法。

作为齿轮中的活动扳手(但完全可以修复):

  1. 索引未命名。

在下面的玩具数据框中:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18
Run Code Online (Sandbox Code Playgroud)

当然,使用以下作品:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12
Run Code Online (Sandbox Code Playgroud)

但我想要一个不同的结果,所以我得到这个结果的方法是:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12
Run Code Online (Sandbox Code Playgroud)

如果我想要一个级别的两个+值和另一个级别的单个(或 2+)值:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18
Run Code Online (Sandbox Code Playgroud)

上面的方法可能有点笨拙,但是我发现它满足了我的需求,而且作为奖励,我更容易理解和阅读。

  • 很好,不知道“Index.isin”的“level”参数! (3认同)
  • 如果没有找到任何内容,“xs”方法也会引发错误,这与返回空列表的“isin”不同。 (2认同)
  • 使用 IndexSlice 可以避免重复解析索引。您可以使用 `df.loc[pd.IndexSlice[:, 'stock1', 'velocity'], :]` 和 `df.loc[pd.IndexSlice[:, ['stock1', 'stock3'], 'velocity'], :]` 正如 cs95 已经演示的那样。 (2认同)