在pandas Intervalindex中查找匹配间隔

cs9*_*s95 12 python intervals pandas

Intervalindex在0.20中有一个名为new 的有趣API ,它允许您创建间隔索引.

给出一些样本数据:

data = [(893.1516130000001, 903.9187099999999),
 (882.384516, 893.1516130000001),
 (817.781935, 828.549032)]
Run Code Online (Sandbox Code Playgroud)

您可以像这样创建索引:

idx = pd.IntervalIndex.from_tuples(data)

print(idx)
IntervalIndex([(893.151613, 903.91871], (882.384516, 893.151613], (817.781935, 828.549032]]
              closed='right',
              dtype='interval[float64]')
Run Code Online (Sandbox Code Playgroud)

Intervals的一个有趣的属性是你可以执行间隔检查in:

print(y[-1])
Interval(817.78193499999998, 828.54903200000001, closed='right')

print(820 in y[-1])
True

print(1000 in y[-1])
False
Run Code Online (Sandbox Code Playgroud)

我想知道如何将此操作应用于整个索引.例如,给定一些数字900,我如何检索此数字适合的区间的布尔掩码?

我能想到:

m = [900 in y for y in idx]
print(m)
[True, False, False]
Run Code Online (Sandbox Code Playgroud)

有没有更好的方法来做到这一点?

Jef*_*eff 17

如果您对性能感兴趣,IntervalIndex会针对搜索进行优化.使用.get_loc.get_indexer使用内部构建的IntervalTree(如二叉树),它是在首次使用时构建的.

In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)

In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# construct tree and search
In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# subsequently
In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# for a single indexer you can do even better (note that this is
# dipping into the impl a bit
In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)

请注意,.get_loc()返回一个索引器(实际上它比布尔数组更有用,但它们可以相互转换).

In [38]: idx.map(lambda x: 900 in x)
    ...: 
Out[38]: 
Index([ True, False, False,  True, False, False,  True, False, False,  True,
       ...
       False,  True, False, False,  True, False, False,  True, False, False], dtype='object', length=30000)

In [39]: idx.get_loc(900)
    ...: 
Out[39]: array([29997,  9987, 10008, ..., 19992, 19989,     0])
Run Code Online (Sandbox Code Playgroud)

返回布尔数组将转换为索引器数组

In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
Out[5]: array([    0,     3,     6, ..., 29991, 29994, 29997])
Run Code Online (Sandbox Code Playgroud)

这就是.get_loc()和.get_indexer()返回的内容:

In [6]: np.sort(idx.get_loc(900))
Out[6]: array([    0,     3,     6, ..., 29991, 29994, 29997])
Run Code Online (Sandbox Code Playgroud)


Flo*_*oor 5

如果您正在寻找速度,您可以使用 idx 的左侧和右侧,即从范围中获取下限和上限,然后检查数字是否落在界限之间,即

\n\n
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))\n
Run Code Online (Sandbox Code Playgroud)\n\n

或者

\n\n
[(900 > idx.left) & (900 <= idx.right)]\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n[真、假、假]\n
\n\n

对于小数据

\n\n
%%timeit\nlist(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))\n100000 loops, best of 3: 11.26 \xc2\xb5s per loop\n\n%%timeit\n[900 in y for y in idx]\n100000 loops, best of 3: 9.26 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于大数据

\n\n
idx = pd.IntervalIndex.from_tuples(data*10000)\n\n%%timeit\nlist(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))\n10 loops, best of 3: 29.2 ms per loop\n\n%%timeit\n[900 in y for y in idx]\n10 loops, best of 3: 64.6 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于大数据,此方法胜过您的解决方案。

\n