Python Pandas:不能做切片索引

Mik*_*cre 3 python multi-index dataframe pandas

我正在尝试使用看起来像这样的pandas multiindex数据框:

                   end ref|alt
chrom start
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
      3001131  3001132     G|A
Run Code Online (Sandbox Code Playgroud)

我希望能够这样做:

df.loc[('chr1', slice(3000714, 3001110))]
Run Code Online (Sandbox Code Playgroud)

失败并出现以下错误:

不能用这些索引器[1204741]进行切片索引

df.index.levels[1].dtype返回dtype('int64'),所以应该使用整数切片吗?

此外,任何关于如何有效地执行此操作的评论都是有价值的,因为数据框有1200万行,我需要使用这种切片查询查询约7000万次.

jez*_*ael 6

我认为你需要添加,:到最后 - 这意味着你需要切片行,但需要所有列:

print (df.loc[('chr1', slice(3000714, 3001110)),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
Run Code Online (Sandbox Code Playgroud)

另一个解决方案是添加axis=0loc:

print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001065  3001066     G|T
      3001110  3001111     G|C
Run Code Online (Sandbox Code Playgroud)

但如果只需要30007143001110:

print (df.loc[('chr1', [3000714, 3001110]),:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C

idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
                   end ref|alt
chrom start                   
chr1  3000714  3000715     T|G
      3001110  3001111     G|C
Run Code Online (Sandbox Code Playgroud)

时间:

In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop

In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop

In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop

In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
Run Code Online (Sandbox Code Playgroud)