NumPy:计算累积中位数

Art*_*nov 6 python statistics numpy vectorization

我有size = n的样本.

我想计算每个i:1 <= i <= n中位数为sample[:i]numpy.例如,我计算每个i的均值:

cummean = np.cumsum(sample) / np.arange(1, n + 1)

我可以为没有周期和理解的中位数做类似的事情吗?

hpa*_*ulj 6

知道 Python 有一个heapq模块可以让你保持一个可迭代的运行“最小值”,我搜索了heapqmedian,并找到了steaming medium. 这个:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

有一个class streamMedian保持两个heapq,一个是值的下半部分,另一个是上半部分。中位数要么是一个的“顶部”,要么是两者的平均值。这个类有一个insert方法和一个getMedian方法。大部分工作都在insert.

我将其复制到 Ipython 会话中,并定义了:

def cummedian_stream(b):
    S=streamMedian()
    ret = []
    for item in b:
        S.insert(item)
        ret.append(S.getMedian())
    return np.array(ret)
Run Code Online (Sandbox Code Playgroud)

测试:

In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop
Run Code Online (Sandbox Code Playgroud)

heapq流的方法是方式更快。


给出的列表理解@Uriel相对较慢。但是,如果我取代np.medianstatistics.median它的速度比@Divakar's排序的解决方案:

def fastloop(a):
    return np.array([np.median(a[:i+1]) for i in range(len(a))])

In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop
Run Code Online (Sandbox Code Playgroud)

@Paul Panzer's分区的方法也不错,但仍然缓慢相比,流类。

In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop
Run Code Online (Sandbox Code Playgroud)

streamMedian如果需要,我可以将课程复制到这个答案)。


Div*_*kar 5

这是一种沿行复制元素以给我们一个2D数组的方法。然后,我们将用一个大数字填充上三角区域,以便稍后当我们沿每一行对数组进行排序时,基本上会对所有元素进行排序,直到对角线元素并模拟累积窗口。median然后,按照选择中间一个或两个中间的平均值(对于偶数个元素)的定义,我们将得到第一个位置的元素 : (0,0),然后对于第二行: 的平均值(1,0) & (1,1),对于第三行row : (2,1),第四行 : 的平均值,(3,1) & (3,2)依此类推。因此,我们将从排序数组中提取出这些元素,从而得到中值。

\n\n

因此,实施将是 -

\n\n
def cummedian_sorted(a):\n    n = a.size\n    maxn = a.max()+1\n    a_tiled_sorted = np.tile(a,n).reshape(-1,n)\n    mask = np.triu(np.ones((n,n),dtype=bool),1)\n\n    a_tiled_sorted[mask] = maxn\n    a_tiled_sorted.sort(1)\n\n    all_rows = a_tiled_sorted[np.arange(n), np.arange(n)//2].astype(float)\n    idx = np.arange(1,n,2)\n    even_rows = a_tiled_sorted[idx, np.arange(1,1+(n//2))]\n    all_rows[idx] += even_rows\n    all_rows[1::2] /= 2.0\n    return all_rows\n
Run Code Online (Sandbox Code Playgroud)\n\n

运行时测试

\n\n

方法 -

\n\n
# Loopy solution from @Uriel's soln   \ndef cummedian_loopy(arr):\n    return [median(a[:i]) for i in range(1,len(a)+1)]\n\n# Nan-fill based solution from @Nickil Maveli's soln   \ndef cummedian_nanfill(arr):\n    a = np.tril(arr).astype(float)\n    a[np.triu_indices(a.shape[0], k=1)] = np.nan\n    return np.nanmedian(a, axis=1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

时间安排 -

\n\n

设置#1:

\n\n
In [43]: a = np.random.randint(0,100,(100))\n\nIn [44]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n    ...: \nTrue\nTrue\n\nIn [45]: %timeit cummedian_loopy(a)\n    ...: %timeit cummedian_nanfill(a)\n    ...: %timeit cummedian_sorted(a)\n    ...: \n1000 loops, best of 3: 856 \xc2\xb5s per loop\n1000 loops, best of 3: 778 \xc2\xb5s per loop\n10000 loops, best of 3: 200 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

设置#2:

\n\n
In [46]: a = np.random.randint(0,100,(1000))\n\nIn [47]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n    ...: \nTrue\nTrue\n\nIn [48]: %timeit cummedian_loopy(a)\n    ...: %timeit cummedian_nanfill(a)\n    ...: %timeit cummedian_sorted(a)\n    ...: \n10 loops, best of 3: 118 ms per loop\n10 loops, best of 3: 47.6 ms per loop\n100 loops, best of 3: 18.8 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

设置#3:

\n\n
In [49]: a = np.random.randint(0,100,(5000))\n\nIn [50]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))\n    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))\n\nTrue\nTrue\n\nIn [54]: %timeit cummedian_loopy(a)\n    ...: %timeit cummedian_nanfill(a)\n    ...: %timeit cummedian_sorted(a)\n    ...: \n1 loops, best of 3: 3.36 s per loop\n1 loops, best of 3: 583 ms per loop\n1 loops, best of 3: 521 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n