Art*_*nov 6 python statistics numpy vectorization
我有size = n的样本.
我想计算每个i:1 <= i <= n中位数为sample[:i]numpy.例如,我计算每个i的均值:
cummean = np.cumsum(sample) / np.arange(1, n + 1)
我可以为没有周期和理解的中位数做类似的事情吗?
知道 Python 有一个heapq模块可以让你保持一个可迭代的运行“最小值”,我搜索了heapq和median,并找到了steaming medium. 这个:
http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/
有一个class streamMedian保持两个heapq,一个是值的下半部分,另一个是上半部分。中位数要么是一个的“顶部”,要么是两者的平均值。这个类有一个insert方法和一个getMedian方法。大部分工作都在insert.
我将其复制到 Ipython 会话中,并定义了:
def cummedian_stream(b):
S=streamMedian()
ret = []
for item in b:
S.insert(item)
ret.append(S.getMedian())
return np.array(ret)
Run Code Online (Sandbox Code Playgroud)
测试:
In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop
Run Code Online (Sandbox Code Playgroud)
该heapq流的方法是方式更快。
给出的列表理解@Uriel相对较慢。但是,如果我取代np.median了statistics.median它的速度比@Divakar's排序的解决方案:
def fastloop(a):
return np.array([np.median(a[:i+1]) for i in range(len(a))])
In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop
Run Code Online (Sandbox Code Playgroud)
和@Paul Panzer's分区的方法也不错,但仍然缓慢相比,流类。
In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop
Run Code Online (Sandbox Code Playgroud)
(streamMedian如果需要,我可以将课程复制到这个答案)。
这是一种沿行复制元素以给我们一个2D数组的方法。然后,我们将用一个大数字填充上三角区域,以便稍后当我们沿每一行对数组进行排序时,基本上会对所有元素进行排序,直到对角线元素并模拟累积窗口。median然后,按照选择中间一个或两个中间的平均值(对于偶数个元素)的定义,我们将得到第一个位置的元素 : (0,0),然后对于第二行: 的平均值(1,0) & (1,1),对于第三行row : (2,1),第四行 : 的平均值,(3,1) & (3,2)依此类推。因此,我们将从排序数组中提取出这些元素,从而得到中值。
因此,实施将是 -
\n\ndef cummedian_sorted(a):\n n = a.size\n maxn = a.max()+1\n a_tiled_sorted = np.tile(a,n).reshape(-1,n)\n mask = np.triu(np.ones((n,n),dtype=bool),1)\n\n a_tiled_sorted[mask] = maxn\n a_tiled_sorted.sort(1)\n\n all_rows = a_tiled_sorted[np.arange(n), np.arange(n)//2].astype(float)\n idx = np.arange(1,n,2)\n even_rows = a_tiled_sorted[idx, np.arange(1,1+(n//2))]\n all_rows[idx] += even_rows\n all_rows[1::2] /= 2.0\n return all_rows\nRun Code Online (Sandbox Code Playgroud)\n\n运行时测试
\n\n方法 -
\n\n# Loopy solution from @Uriel's soln \ndef cummedian_loopy(arr):\n return [median(a[:i]) for i in range(1,len(a)+1)]\n\n# Nan-fill based solution from @Nickil Maveli's soln \ndef cummedian_nanfill(arr):\n a = np.tril(arr).astype(float)\n a[np.triu_indices(a.shape[0], k=1)] = np.nan\n return np.nanmedian(a, axis=1)\nRun Code Online (Sandbox Code Playgroud)\n\n时间安排 -
\n\n设置#1:
\n\nIn [43]: a = np.random.randint(0,100,(100))\n\nIn [44]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n ...: \nTrue\nTrue\n\nIn [45]: %timeit cummedian_loopy(a)\n ...: %timeit cummedian_nanfill(a)\n ...: %timeit cummedian_sorted(a)\n ...: \n1000 loops, best of 3: 856 \xc2\xb5s per loop\n1000 loops, best of 3: 778 \xc2\xb5s per loop\n10000 loops, best of 3: 200 \xc2\xb5s per loop\nRun Code Online (Sandbox Code Playgroud)\n\n设置#2:
\n\nIn [46]: a = np.random.randint(0,100,(1000))\n\nIn [47]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n ...: \nTrue\nTrue\n\nIn [48]: %timeit cummedian_loopy(a)\n ...: %timeit cummedian_nanfill(a)\n ...: %timeit cummedian_sorted(a)\n ...: \n10 loops, best of 3: 118 ms per loop\n10 loops, best of 3: 47.6 ms per loop\n100 loops, best of 3: 18.8 ms per loop\nRun Code Online (Sandbox Code Playgroud)\n\n设置#3:
\n\nIn [49]: a = np.random.randint(0,100,(5000))\n\nIn [50]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n\nTrue\nTrue\n\nIn [54]: %timeit cummedian_loopy(a)\n ...: %timeit cummedian_nanfill(a)\n ...: %timeit cummedian_sorted(a)\n ...: \n1 loops, best of 3: 3.36 s per loop\n1 loops, best of 3: 583 ms per loop\n1 loops, best of 3: 521 ms per loop\nRun Code Online (Sandbox Code Playgroud)\n