当累积和 > x 时分割数组的 Numpy 方法

Cod*_*oob 6 python arrays numpy

数据

让我们采用以下二维数组:

starts = [0, 4, 10, 13, 23, 27]
ends = [4, 10, 13, 23, 27, 32]
lengths = [4, 6, 3, 10, 4, 5] 

arr = np.array([starts, ends, lengths]).T
Run Code Online (Sandbox Code Playgroud)

因此看起来像:

[[ 0  4  4]
 [ 4 10  6]
 [10 13  3]
 [13 23 10]
 [23 27  4]
 [27 32  5]]
Run Code Online (Sandbox Code Playgroud)

目标

现在我想“循环”通过lengthsand 一旦累积和达到10我想输出startsandends然后重新启动累积计数。


工作代码

tot_size = 0
start = 0

for i, numb in enumerate(arr[:,-1]):
    # update sum
    tot_size += numb

    # Check if target size is reached
    if tot_size >= 10:
        start_loc, end_loc = arr[:,0][start],  arr[:,1][i]
        print('Start: {}\nEnd: {}\nSum: {}\n'.format(start_loc, end_loc, tot_size))
        start = i + 1
        tot_size = 0

# Last part
start_loc, end_loc = arr[:,0][start],  arr[:,1][i]
print('Start: {}\nEnd: {}\nSum: {}\n'.format(start_loc, end_loc, tot_size))
Run Code Online (Sandbox Code Playgroud)

将打印:

开始: 0 结束: 10 总和: 10

开始: 10 结束: 23 总和: 13

开始: 23 结束: 32 总和: 9

我不需要知道结果总和,但我确实需要知道startsends


Numpy 尝试

我想一定有一种更直接的,或者说矢量化的方式来用 numpy 来做到这一点。

  • cumsum+remainder

我正在考虑类似的事情,但是当某件事接近目标数字(此处)时,np.remainder(np.cumsum(arr[:,-1]), 10)很难说,这与仅在以下情况下分割不同:10sum > x

  • stride_tricks

由于上面的方法在窗口中不起作用,我想到了步幅,但这些窗口的大小是固定的

欢迎所有想法:)

Jér*_*ard 1

Numpy is not designed for solving efficiently such a problem. You can still solve this using some tricks or the usual combination of cumsum + division + diff + where or similar ones (like @Kevin proposed), but AFAIK they are all inefficient. Indeed, they require many temporary arrays and expensive operations.

Temporary arrays are expensive for two reasons: for small arrays, the overhead of Numpy function is typically of several microseconds per call resulting in generally in dozens of microseconds for the whole operation; and for big arrays, each operation will be memory bound and memory bandwidth is small on modern platforms. Actually, it is even worst since writing in newly allocated array is much slower due to page faults and Numpy array writes are currently not optimized on most platforms (including the mainstream x86-64 one).

As for "expensive operations" this includes sorting which runs in O(n log n) (quick-sort is used by default) and is generally memory bound, finding the unique values (which currently does a sort internally) and integer division which is known to be very slow since ever.


One solution to solve this problem is to use Numba (or Cython). Numba use a just-in-time compiler so to write fast optimized function. It is especially useful to write your own efficient basic Numpy built-ins. Here is an example based on your code:

import numba as nb
import numpy as np

@nb.njit(['(int32[:,:],)', '(int64[:,:],)'])
def compute(arr):
    n = len(arr)
    tot_size, start, cur = 0, 0, 0
    slices = np.empty((n, 2), arr.dtype)

    for i in range(n):
        tot_size += arr[i, 2]

        if tot_size >= 10:
            slices[cur, 0] = arr[start, 0]
            slices[cur, 1] = arr[i, 1]
            start = i + 1
            cur += 1
            tot_size = 0

    slices[cur, 0] = arr[start, 0]
    slices[cur, 1] = arr[i, 1]
    return slices[:cur+1]
Run Code Online (Sandbox Code Playgroud)

For your small example, the Numba function takes about 0.75 us on my machine while the initial solution takes 3.5 us. In comparison, the Numpy solutions provided by @Kevin (returning the indices) takes 24 us for the np.unique and 6 us for the division-based solution. In fact, the basic np.cumsum already takes 0.65 us on my machine. Thus, the Numba solution is the fastest. It should be especially true for larger arrays.