Ric*_*son 16 python numpy scipy python-2.7 pandas
我想找出在Python中实现以下目标的最有效方法:
假设我们有两个列表a,b它们长度相等,最多包含1e7个元素.但是,为了便于说明,我们可能会考虑以下因素:
a = [2, 1, 2, 3, 4, 5, 4, 6, 5, 7, 8, 9, 8,10,11]
b = [1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15]
Run Code Online (Sandbox Code Playgroud)
目标是创建严格单调的列表a_new,a而仅使用具有相同值的样本点的第一个样本点.a还应删除必须删除的相同索引,以便b最终结果为:
a_new = [2, 3, 4, 5, 6, 7, 8, 9,10,11]
b_new = [1, 4, 5, 6, 8,10,11,12,14,15]
Run Code Online (Sandbox Code Playgroud)
当然,这可以使用计算上昂贵的for循环来完成,然而由于大量数据,这不适合.
任何建议都非常感谢.
Psi*_*dom 14
您可以计算累计最大值a然后删除重复项np.unique,您还可以使用它来记录唯一索引以便b相应地进行子集化:
a = np.array([2,1,2,3,4,5,4,6,5,7,8,9,8,10,11])
b = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
a_cummax = np.maximum.accumulate(a)
a_new, idx = np.unique(a_cummax, return_index=True)
a_new
# array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
b[idx]
# array([ 1, 4, 5, 6, 8, 10, 11, 12, 14, 15])
Run Code Online (Sandbox Code Playgroud)
piR*_*red 11
运行@ juanpa.arrivillaga的函数版本 numba
import numba
def psi(A):
a_cummax = np.maximum.accumulate(A)
a_new, idx = np.unique(a_cummax, return_index=True)
return idx
def foo(arr):
aux=np.maximum.accumulate(arr)
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
return np.nonzero(flag)[0]
@numba.jit
def f(A):
m = A[0]
a_new, idx = [m], [0]
for i, a in enumerate(A[1:], 1):
if a > m:
m = a
a_new.append(a)
idx.append(i)
return idx
Run Code Online (Sandbox Code Playgroud)
定时
%timeit f(a)
The slowest run took 5.37 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.83 µs per loop
%timeit foo(a)
The slowest run took 9.41 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.35 µs per loop
%timeit psi(a)
The slowest run took 9.66 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.95 µs per loop
Run Code Online (Sandbox Code Playgroud)
jua*_*aga 10
这是一个vanilla Python解决方案,只需一次通过:
>>> a = [2,1,2,3,4,5,4,6,5,7,8,9,8,10,11]
>>> b = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
>>> a_new, b_new = [], []
>>> last = float('-inf')
>>> for x, y in zip(a, b):
... if x > last:
... last = x
... a_new.append(x)
... b_new.append(y)
...
>>> a_new
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> b_new
[1, 4, 5, 6, 8, 10, 11, 12, 14, 15]
Run Code Online (Sandbox Code Playgroud)
我很想知道它与numpy解决方案的比较,它将具有相似的时间复杂度,但会对数据进行一些传递.
这是一些时间安排.首先,设置:
>>> small = ([2,1,2,3,4,5,4,6,5,7,8,9,8,10,11], [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
>>> medium = (np.random.randint(1, 10000, (10000,)), np.random.randint(1, 10000, (10000,)))
>>> large = (np.random.randint(1, 10000000, (10000000,)), np.random.randint(1, 10000000, (10000000,)))
Run Code Online (Sandbox Code Playgroud)
现在有两种方法:
>>> def monotonic(a, b):
... a_new, b_new = [], []
... last = float('-inf')
... for x,y in zip(a,b):
... if x > last:
... last = x
... a_new.append(x)
... b_new.append(y)
... return a_new, b_new
...
>>> def np_monotonic(a, b):
... a_new, idx = np.unique(np.maximum.accumulate(a), return_index=True)
... return a_new, b[idx]
...
Run Code Online (Sandbox Code Playgroud)
注意,这些方法并不严格等同,一个停留在香草Python土地上,另一个停留在numpy阵列土地上.假设您从相应的数据结构(numpy.array或者list)开始,我们将比较性能:
首先,一个小的列表,与OP的示例相同,我们看到numpy实际上并不快,这对于小型数据结构来说并不奇怪:
>>> timeit.timeit("monotonic(a,b)", "from __main__ import monotonic, small; a, b = small", number=10000)
0.039130652003223076
>>> timeit.timeit("np_monotonic(a,b)", "from __main__ import np_monotonic, small, np; a, b = np.array(small[0]), np.array(small[1])", number=10000)
0.10779813499539159
Run Code Online (Sandbox Code Playgroud)
现在是10,000个元素的"中等"列表/数组,我们开始看到numpy优势:
>>> timeit.timeit("monotonic(a,b)", "from __main__ import monotonic, medium; a, b = medium[0].tolist(), medium[1].tolist()", number=10000)
4.642718859016895
>>> timeit.timeit("np_monotonic(a,b)", "from __main__ import np_monotonic, medium; a, b = medium", number=10000)
1.3776302759943064
Run Code Online (Sandbox Code Playgroud)
现在,有趣的是,优势似乎缩小了"大"数组,大约为1e7元素:
>>> timeit.timeit("monotonic(a,b)", "from __main__ import monotonic, large; a, b = large[0].tolist(), large[1].tolist()", number=10)
4.400254560023313
>>> timeit.timeit("np_monotonic(a,b)", "from __main__ import np_monotonic, large; a, b = large", number=10)
3.593393853981979
Run Code Online (Sandbox Code Playgroud)
请注意,在最后一对时间中,我每次只做了10次,但如果有人有更好的机器或更多的耐心,请随意增加 number
hpa*_*ulj 10
unique与return_index用途argsort.随着maximum.accumulate这是没有必要的.所以我们可以蚕食unique并做:
In [313]: a = [2,1,2,3,4,5,4,6,5,7,8,9,8,10,11]
In [314]: arr = np.array(a)
In [315]: aux = np.maximum.accumulate(arr)
In [316]: flag = np.concatenate(([True], aux[1:] != aux[:-1])) # key unique step
In [317]: idx = np.nonzero(flag)
In [318]: idx
Out[318]: (array([ 0, 3, 4, 5, 7, 9, 10, 11, 13, 14], dtype=int32),)
In [319]: arr[idx]
Out[319]: array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [320]: np.array(b)[idx]
Out[320]: array([ 1, 4, 5, 6, 8, 10, 11, 12, 14, 15])
In [323]: np.unique(aux, return_index=True)[1]
Out[323]: array([ 0, 3, 4, 5, 7, 9, 10, 11, 13, 14], dtype=int32)
Run Code Online (Sandbox Code Playgroud)
def foo(arr):
aux=np.maximum.accumulate(arr)
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
return np.nonzero(flag)[0]
In [330]: timeit foo(arr)
....
100000 loops, best of 3: 12.5 µs per loop
In [331]: timeit np.unique(np.maximum.accumulate(arr), return_index=True)[1]
....
10000 loops, best of 3: 21.5 µs per loop
Run Code Online (Sandbox Code Playgroud)
凭借(10000,)形状,medium这种无排他性的独特性具有显着的速度优势:
In [334]: timeit np.unique(np.maximum.accumulate(medium[0]), return_index=True)[1]
1000 loops, best of 3: 351 µs per loop
In [335]: timeit foo(medium[0])
The slowest run took 4.14 times longer ....
10000 loops, best of 3: 48.9 µs per loop
Run Code Online (Sandbox Code Playgroud)
[1]:np.source(np.unique)用于查看代码,或者?? 在IPython中
| 归档时间: |
|
| 查看次数: |
2189 次 |
| 最近记录: |