Numpy数组:有效地找到匹配的索引

use*_*855 6 python numpy scipy

我有两个列表,其中一个是大量的(数百万个元素),另外几个.我想做以下事情

bigArray=[0,1,0,2,3,2,,.....]

smallArray=[0,1,2,3,4]

for i in len(smallArray):
  pts=np.where(bigArray==smallArray[i])
  #Do stuff with pts...
Run Code Online (Sandbox Code Playgroud)

以上工作,但很慢.有没有办法更有效地做到这一点,而不诉诸于在C中写一些东西?

seg*_*sai 8

在您的情况下,您可以从预先分配您的大阵列中受益.下面的示例演示如何将时间从约45秒减少到2秒(在我的笔记本电脑上)(对于阵列5e6与1e3的一组特定长度).显然,如果阵列大小差别很大,那么解决方案将不是最佳选择.例如,使用默认解决方案,复杂度为O(bigN*smallN),但对于我建议的解决方案,它是O((bigN + smallN)*log(bigN))

import numpy as np, numpy.random as nprand, time, bisect

bigN = 5e6
smallN = 1000
maxn = 1e7
nprand.seed(1)  
bigArr = nprand.randint(0, maxn, size=bigN)
smallArr = nprand.randint(0, maxn, size=smallN)

# brute force 
t1 = time.time()
for i in range(len(smallArr)):
    inds = np.where(bigArr == smallArr[i])[0]
t2 = time.time()
print "Brute", t2-t1

# not brute force (like nested loop with index scan)
t1 = time.time()
sortedind = np.argsort(bigArr)
sortedbigArr = bigArr[sortedind]
for i in range(len(smallArr)):
    i1 = bisect.bisect_left(sortedbigArr, smallArr[i])
    i2 = bisect.bisect_right(sortedbigArr, smallArr[i])
    inds = sortedind[i1:i2]
t2=time.time()
print "Non-brute", t2-t1
Run Code Online (Sandbox Code Playgroud)

输出:

Brute 42.5278530121

非暴力1.57193303108

  • 我不完全确定,但是通过使用`np.searchsorted`代替带有二分的循环,可能存在优化空间. (3认同)

小智 7

Numpy 提供了 numpy.searchsorted 函数:http ://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.searchsorted.html

例子:

>>> import numpy as np
>>> sorted = np.argsort(big_list)
>>> r = np.searchsorted(big_list, small_list, side='right',sorter=sorted)
>>> l  = np.searchsorted(big_list, small_list, side='left',sorter=sorted)
>>> for b, e in zip(l, r):
...     inds = sorted[b:e]
Run Code Online (Sandbox Code Playgroud)