Gyo*_*yom 48 python matlab numpy matplotlib
在一个pylab程序(也可能是一个matlab程序)中,我有一个数字表示距离的numpy数组:d[t]
是时间距离t
(我的数据的时间跨度是len(d)
时间单位).
我感兴趣的事件是当距离低于某个阈值时,我想计算这些事件的持续时间.很容易得到一组布尔值b = d<threshold
,问题归结为计算真实单词长度的顺序b
.但我不知道如何有效地做到这一点(即使用numpy原语),并且我使用数组并进行手动更改检测(即当值从False变为True时初始化计数器,只要值为True就增加计数器,当值返回False时,将计数器输出到序列.但这非常缓慢.
如何在numpy数组中有效地检测那种序列?
下面是一些python代码,说明我的问题:第四个点需要很长时间才能出现(如果没有,增加数组的大小)
from pylab import *
threshold = 7
print '.'
d = 10*rand(10000000)
print '.'
b = d<threshold
print '.'
durations=[]
for i in xrange(len(b)):
if b[i] and (i==0 or not b[i-1]):
counter=1
if i>0 and b[i-1] and b[i]:
counter+=1
if (b[i-1] and not b[i]) or i==len(b)-1:
durations.append(counter)
print '.'
Run Code Online (Sandbox Code Playgroud)
Ale*_*lli 47
虽然不是numpy
原始的,但itertools
功能通常非常快,所以请尝试这个(并且测量包括这个在内的各种解决方案的时间):
def runs_of_ones(bits):
for bit, group in itertools.groupby(bits):
if bit: yield sum(group)
Run Code Online (Sandbox Code Playgroud)
如果确实需要列表中的值,那么当然可以使用list(runs_of_ones(bits)); 但也许列表理解可能会略微加快:
def runs_of_ones_list(bits):
return [sum(g) for b, g in itertools.groupby(bits) if b]
Run Code Online (Sandbox Code Playgroud)
转向"numpy-native"的可能性,那么:
def runs_of_ones_array(bits):
# make sure all runs of ones are well-bounded
bounded = numpy.hstack(([0], bits, [0]))
# get 1 at run starts and -1 at run ends
difs = numpy.diff(bounded)
run_starts, = numpy.where(difs > 0)
run_ends, = numpy.where(difs < 0)
return run_ends - run_starts
Run Code Online (Sandbox Code Playgroud)
再次:确保在实际的示例中针对彼此对比解决方案!
Tho*_*wne 42
适用于任何阵列的完全numpy矢量化和通用RLE(也适用于字符串,布尔值等).
输出运行长度,起始位置和值的元组.
import numpy as np
def rle(inarray):
""" run length encoding. Partial credit to R rle function.
Multi datatype arrays catered for including non Numpy
returns: tuple (runlengths, startpositions, values) """
ia = np.asarray(inarray) # force numpy
n = len(ia)
if n == 0:
return (None, None, None)
else:
y = np.array(ia[1:] != ia[:-1]) # pairwise unequal (string safe)
i = np.append(np.where(y), n - 1) # must include last element posi
z = np.diff(np.append(-1, i)) # run lengths
p = np.cumsum(np.append(0, z))[:-1] # positions
return(z, p, ia[i])
Run Code Online (Sandbox Code Playgroud)
非常快(i7):
xx = np.random.randint(0, 5, 1000000)
%timeit yy = rle(xx)
100 loops, best of 3: 18.6 ms per loop
Run Code Online (Sandbox Code Playgroud)
多种数据类型:
rle([True, True, True, False, True, False, False])
Out[8]:
(array([3, 1, 1, 2]),
array([0, 3, 4, 5]),
array([ True, False, True, False], dtype=bool))
rle(np.array([5, 4, 4, 4, 4, 0, 0]))
Out[9]: (array([1, 4, 2]), array([0, 1, 5]), array([5, 4, 0]))
rle(["hello", "hello", "my", "friend", "okay", "okay", "bye"])
Out[10]:
(array([2, 1, 1, 2, 1]),
array([0, 2, 3, 4, 6]),
array(['hello', 'my', 'friend', 'okay', 'bye'],
dtype='|S6'))
Run Code Online (Sandbox Code Playgroud)
与Alex Martelli相同的结果如上:
xx = np.random.randint(0, 2, 20)
xx
Out[60]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])
am = runs_of_ones_array(xx)
tb = rle(xx)
am
Out[63]: array([4, 5, 2, 5])
tb[0][tb[2] == 1]
Out[64]: array([4, 5, 2, 5])
%timeit runs_of_ones_array(xx)
10000 loops, best of 3: 28.5 µs per loop
%timeit rle(xx)
10000 loops, best of 3: 38.2 µs per loop
Run Code Online (Sandbox Code Playgroud)
比Alex慢一点(但速度非常快),而且更加灵活.
这是一个仅使用数组的解决方案:它采用包含bool序列的数组并计算转换的长度.
>>> from numpy import array, arange
>>> b = array([0,0,0,1,1,1,0,0,0,1,1,1,1,0,0], dtype=bool)
>>> sw = (b[:-1] ^ b[1:]); print sw
[False False True False False True False False True False False False
True False]
>>> isw = arange(len(sw))[sw]; print isw
[ 2 5 8 12]
>>> lens = isw[1::2] - isw[::2]; print lens
[3 4]
Run Code Online (Sandbox Code Playgroud)
sw
包含一个true,其中有一个开关,isw
在索引中转换它们.然后成对地减去isw的项目lens
.
请注意,如果序列以1开始,它将计算0s序列的长度:这可以在索引中修复以计算镜头.另外,我没有测试过角情况这样的长度为1的序列.
完整函数,返回所有子True
列的起始位置和长度.
import numpy as np
def count_adjacent_true(arr):
assert len(arr.shape) == 1
assert arr.dtype == np.bool
if arr.size == 0:
return np.empty(0, dtype=int), np.empty(0, dtype=int)
sw = np.insert(arr[1:] ^ arr[:-1], [0, arr.shape[0]-1], values=True)
swi = np.arange(sw.shape[0])[sw]
offset = 0 if arr[0] else 1
lengths = swi[offset+1::2] - swi[offset:-1:2]
return swi[offset:-1:2], lengths
Run Code Online (Sandbox Code Playgroud)
测试了不同的bool 1D阵列(空数组;单个/多个元素;偶数/奇数长度;以True
/ 开头False
;仅使用True
/ False
元素)
万一有人好奇(因为你传递了MATLAB),这里有一种方法可以在MATLAB中解决它:
threshold = 7;
d = 10*rand(1,100000); % Sample data
b = diff([false (d < threshold) false]);
durations = find(b == -1)-find(b == 1);
Run Code Online (Sandbox Code Playgroud)
我对Python并不太熟悉,但也许这可以帮助你提供一些想法.=)
也许有点晚了,但总体来说基于 Numba 的方法将是最快的。
\nimport numpy as np\nimport numba as nb\n\n\n@nb.njit\ndef count_steps_nb(arr):\n result = 1\n last_x = arr[0]\n for x in arr[1:]:\n if last_x != x:\n result += 1\n last_x = x\n return result\n\n\n@nb.njit\ndef rle_nb(arr):\n n = count_steps_nb(arr)\n values = np.empty(n, dtype=arr.dtype)\n lengths = np.empty(n, dtype=np.int_)\n last_x = arr[0]\n length = 1\n i = 0\n for x in arr[1:]:\n if last_x != x:\n values[i] = last_x\n lengths[i] = length\n i += 1\n length = 1\n last_x = x\n else:\n length += 1\n values[i] = last_x\n lengths[i] = length\n return values, lengths\n
Run Code Online (Sandbox Code Playgroud)\n基于 Numpy 的方法(受到@ThomasBrowne 答案的启发,但速度更快,因为昂贵的使用numpy.concatenate()
减少到最低限度)是亚军(这里提出了两种类似的方法,一种使用不等式来查找步骤的位置,另一个使用差异):
import numpy as np\n\n\ndef rle_np_neq(arr):\n n = len(arr)\n if n == 0:\n values = np.empty(0, dtype=arr.dtype)\n lengths = np.empty(0, dtype=np.int_)\n else:\n positions = np.concatenate(\n [[-1], np.nonzero(arr[1:] != arr[:-1])[0], [n - 1]])\n lengths = positions[1:] - positions[:-1]\n values = arr[positions[1:]]\n return values, lengths\n
Run Code Online (Sandbox Code Playgroud)\nimport numpy as np\n\n\ndef rle_np_diff(arr):\n n = len(arr)\n if n == 0:\n values = np.empty(0, dtype=arr.dtype)\n lengths = np.empty(0, dtype=np.int_)\n else:\n positions = np.concatenate(\n [[-1], np.nonzero(arr[1:] - arr[:-1])[0], [n - 1]])\n lengths = positions[1:] - positions[:-1]\n values = arr[positions[1:]]\n return values, lengths\n
Run Code Online (Sandbox Code Playgroud)\n这些都优于 na\xc3\xafve 和简单的单循环方法:
\nimport numpy as np\n\n\ndef rle_loop(arr):\n values = []\n lengths = []\n last_x = arr[0]\n length = 1\n for x in arr[1:]:\n if last_x != x:\n values.append(last_x)\n lengths.append(length)\n length = 1\n last_x = x\n else:\n length += 1\n values.append(last_x)\n lengths.append(length)\n return np.array(values), np.array(lengths)\n
Run Code Online (Sandbox Code Playgroud)\n相反,使用itertools.groupby()
不会比简单循环更快(除非在@AlexMartelli回答中这样的非常特殊的情况下或者有人将在组对象上实现__len__
),因为通常没有简单的方法来提取组大小信息除了循环组本身之外,这并不快:
import itertools\nimport numpy as np\n\n\ndef rle_groupby(arr):\n values = []\n lengths = []\n for x, group in itertools.groupby(arr):\n values.append(x)\n lengths.append(sum(1 for _ in group))\n return np.array(values), np.array(lengths)\n
Run Code Online (Sandbox Code Playgroud)\n报告了对不同大小的随机整数数组的一些基准测试的结果:
\n\n(完整分析见此处)。
\n 归档时间: |
|
查看次数: |
17159 次 |
最近记录: |