Ven*_*tta 4 python numpy vectorization
我有一个时间序列A.我想生成另一个时间序列B,这样
B [i] = j,其中j是大于i的第一个索引,使得A [j]> A [i].
在numpy中有这么快的方法吗?
谢谢.
[已编辑]:最好只使用O(n)空间.
测试不充分,请自担风险.
import numpy
a = numpy.random.random(100)
# a_by_a[i,j] = a[i] > a[j]
a_by_a = a[numpy.newaxis,:] > a[:,numpy.newaxis]
# by taking the upper triangular, we ignore all cases where i < j
a_by_a = numpy.triu(a_by_a)
# argmax will give the first index with the highest value (1 in this case)
print numpy.argmax(a_by_a, axis = 1)
Run Code Online (Sandbox Code Playgroud)
内存版本较低:
a = numpy.random.random(100)
@numpy.vectorize
def values(i):
try:
return (a[i:] > a[i]).nonzero()[0][0] + i
except IndexError:
return -1 # no valid values found
b = values(numpy.arange(100))
Run Code Online (Sandbox Code Playgroud)
版本更快:
@np.vectorize
def values(i):
try:
return next(idx for idx, value in enumerate(A[i+1:]) if value > A[i]) + i + 1
except StopIteration:
return -1 # no valid values found
return values(np.arange(len(A)))
Run Code Online (Sandbox Code Playgroud)
更快的版本:
def future6(A):
# list of tuples (index into A, value in A)
# invariant: indexes and values in sorted order
known = []
result = []
for idx in xrange(len(A) - 1, -1, -1):
value = A[idx]
# since known is sorted a binary search could be applied here
# I haven't bothered
# anything lower then the current value
# cannot possibly be used again, since this value will be first index instead
# of those
known = [(x,y) for x,y in known if y > value]
if known:
# all values in known are > current value
# they reverse sorted by index
# the value with the lowest index is first
result.append(known[-1][0])
else:
# no values exist this high, report -1
result.append(-1)
# add to end of the list to maintain invariant
known.append( (idx, value) )
# let numpy worry about reversing the array
return np.array(result)[::-1]
Run Code Online (Sandbox Code Playgroud)
感谢cyborg这里使用的一些想法.
算法差异
cyborg显示各种算法之间存在显着差异,具体取决于输入数据的数据.我从运行的算法中收集了一些统计信息,看看我是否能弄清楚发生了什么.
随机数据:
Average distance between value and its target: 9
Average length of ascends list: 24
Average length of segment in ascends list: 1.33
Average length of known list: 9.1
Run Code Online (Sandbox Code Playgroud)
由于列表非常短,因此上升算法主要衰减为线性搜索.它确实清除了将来无法使用的上升,因此它仍然明显优于线性搜索.
摆动:
Average distance between value and its target: 31.46
Average length of ascends list: 84
Average length of segment in ascends list: 1.70
Average length of known list: 57.98
Run Code Online (Sandbox Code Playgroud)
振荡倾向于使不同的部分进一步分开.这自然妨碍了线性搜索算法.两种"更智能"的算法都必须跟踪其他数据.我的算法大大衰减,因为它每次都会扫描数据.上升算法触及较少的数据并且做得更好.
上升:
Average distance between value and its target: 2.57
Average length of ascends list: 40
Average length of segment in ascends list: 3.27
Average length of known list: 3037.97
Run Code Online (Sandbox Code Playgroud)
很清楚为什么我的算法有问题,它必须跟踪大量的提升值.值和目标之间的短距离解释了线性搜索的良好性能.Ascends仍然没有与很长的细分市场合作.
更好的算法
我的算法没有理由对数据进行线性搜索.它已排序,我们只需要从列表末尾删除小值.
def future6(A):
# list of tuples (index into A, value in A)
# invariant: indexes and values in sorted order
known = []
result = []
for idx in xrange(len(A) - 1, -1, -1):
value = A[idx]
# since known is sorted a binary search could be applied here
# I haven't bothered
# anything lower then the current value
# cannot possibly be used again, since this value will be first index instead
# of those
while known and known[-1][1] < value:
known.pop()
if known:
# all values in known are > current value
# they reverse sorted by index
# the value with the lowest index is first
result.append(known[-1][0])
else:
# no values exist this high, report -1
result.append(-1)
# add to end of the list to maintain invariant
known.append( (idx, value) )
# let numpy worry about reversing the array
return np.array(result)[::-1]
Run Code Online (Sandbox Code Playgroud)
但是我发现我们可以重用以前的B计算值而不是构建新的数据结构.如果j> i,A [i]> A [j]则B [i]> B [j].
def future8(A):
B = [-1] * len(A)
for index in xrange(len(A)-2, -1, -1):
target = index + 1
value = A[index]
while target != -1 and A[target] < value:
target = B[target]
B[index] = target
return np.array(B)
Run Code Online (Sandbox Code Playgroud)
我的基准结果:
Random series:
future2 ascends : 0.242569923401
future6 full list: 0.0363488197327
future7 vectorize: 0.129994153976
future8 reuse: 0.0299410820007
Oscillating series:
future2 ascends : 0.233623981476
future6 full list: 0.0360488891602
future7 vectorize: 1.19140791893
future8 reuse: 0.0297570228577
Ascending trend series:
future2 ascends : 0.120707035065
future6 full list: 0.0314049720764
future7 vectorize: 0.0640320777893
future8 reuse: 0.0246520042419
Run Code Online (Sandbox Code Playgroud)
升序部分
Cyborg有一个非常有趣的想法,可以利用提升的细分市场.我不认为他的任何测试案例真的表现出他在那之后的行为.我不认为这些部分足够长,可以利用.但我认为真实数据可能有这样的部分,所以利用它将是非常有帮助的.
但我认为它不会起作用.准备必要的数据进行二进制搜索需要花费O(n)的时间.如果我们多次进行二进制搜索,那就没问题,但是一旦我们在升序部分的中间找到一个值,我们就再也不会重新访问左边的任何内容了.因此,即使使用二进制搜索,我们也花费最多O(n)时间来处理数据.
如果构建必要数据的成本较低,那么稍后扫描升序部分就可以了.但扫描非常便宜,你很难想出一种方法来处理更便宜的升序部分.
| 归档时间: |
|
| 查看次数: |
1086 次 |
| 最近记录: |