熊猫IndexSlice如何工作

Question

熊猫IndexSlice如何工作

我正在关注本教程：GitHub链接

如果您向下滚动（Ctrl + F：练习：选择最受欢迎的啤酒）到显示以下内容的部分Exercise: Select the most-reviewd beers：

数据框是多重的：

要选择评论最多的啤酒：

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

Run Code Online (Sandbox Code Playgroud)

我的问题是如何使用IndexSlice，如何在top_beers之后代码仍然运行的情况下跳过冒号？

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']]

Run Code Online (Sandbox Code Playgroud)

有三个索引pofile_name，beed_id和time。为什么pd.IndexSlice[:, top_beers]起作用（未指定如何处理时间列）？

Answer 1

nor*_*ius 6

为了补充先前的答案，让我解释一下如何pd.IndexSlice工作以及为什么有用。

好吧，关于它的实现没有太多要说的。在您阅读源代码时，它仅执行以下操作：

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

Run Code Online (Sandbox Code Playgroud)

由此可见，这pd.IndexSlice仅转发了__getitem__已收到的论点。看起来很傻，不是吗？但是，它实际上可以执行某些操作。

如您所知，obj.__getitem__(arg)如果obj通过对象的括号运算符访问对象，则会调用obj[arg]。对于序列类型的对象，arg可以是整数或切片对象。我们很少自己构造切片。相反，我们将:为此使用切片运算符（也称为省略号），例如obj[0:5]。

这就是重点。python解释器会:在调用对象的__getitem__(arg)方法之前将这些slice运算符转换为slice对象。因此，的返回值IndexSlice.__getItem__() 实际上是一个切片，一个整数（如果未:使用）或它们的元组（如果传递了多个参数）。总而言之，的唯一目的IndexSlice是我们不必自己构造切片。此行为对于尤其有用pd.DataFrame.loc。

首先让我们看一下以下示例：

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

Run Code Online (Sandbox Code Playgroud)

因此，所有冒号:都转换为相应的切片对象。如果将多个参数传递给索引运算符，则这些参数将作为n元组返回。

为了演示此方法对于df具有多级索引的熊猫数据帧的有用性，让我们看以下内容。

# Let's first construct a table with a three-level
# row-index, and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return 'col1', select all rows.
df.loc[:,'col1']            # pd.Series         

# Note: in the above example, the returned value has type
# pd.Series, because only one column is returned. One can 
# enforce the returned object to be a data-frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
# however leads to a SyntaxError because the slice 
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can also expand the slice specification by third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

Run Code Online (Sandbox Code Playgroud)

因此，总而言之，pd.IndexSlice当为行和列索引指定切片时，有助于提高可读性。

熊猫对这些切片规格的处理方式则不同。从本质上讲，它从最上层的索引级别开始选择行/列，并在向下移动到更低级别时减少选择，具体取决于已指定的级别数。pd.DataFrame.loc是具有自身__getitem__()功能的对象，可以完成所有这些工作。

正如您在评论中已经指出的那样，在某些特殊情况下，熊猫似乎表现得很怪异。您提到的两个示例实际上将得出相同的结果。但是，它们在内部被熊猫区别对待。

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

Run Code Online (Sandbox Code Playgroud)

诚然，差异是微妙的。

Answer 2

Tom*_*ger 5

Pandas 仅要求您指定足够级别的 MultiIndex 来消除歧义。由于您在第二级进行切片，因此您需要第一个:说明我没有在此级别进行过滤。

任何未指定的附加级别都会完整返回，因此相当于:每个级别上的 a。

归档时间：	8 年，6 月前
查看次数：	3733 次
最近记录：	6 年，8 月前