在numpy数组中转发NaN值的最有效方法

Xuk*_*rao 35 python arrays performance numpy pandas

示例问题

举个简单的例子,考虑arr如下定义的numpy数组:

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])
Run Code Online (Sandbox Code Playgroud)

arr在控制台输出中看起来像这样:

array([[  5.,  nan,  nan,   7.,   2.],
       [  3.,  nan,   1.,   8.,  nan],
       [  4.,   9.,   6.,  nan,  nan]])
Run Code Online (Sandbox Code Playgroud)

我现在想逐行'向前填充' nan数组中的值arr.我的意思是用nan左边最近的有效值替换每个值.期望的结果如下所示:

array([[  5.,   5.,   5.,  7.,  2.],
       [  3.,   3.,   1.,  8.,  8.],
       [  4.,   9.,   6.,  6.,  6.]])
Run Code Online (Sandbox Code Playgroud)

到目前为止尝试过

我尝试过使用for循环:

for row_idx in range(arr.shape[0]):
    for col_idx in range(arr.shape[1]):
        if np.isnan(arr[row_idx][col_idx]):
            arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
Run Code Online (Sandbox Code Playgroud)

我也尝试使用pandas数据帧作为中间步骤(因为pandas数据帧有一个非常简洁的内置前向填充方法):

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()
Run Code Online (Sandbox Code Playgroud)

上述两种策略都会产生预期的结果,但我一直想知道:只使用numpy矢量化操作的策略不是最有效的策略吗?


摘要

是否有另一种更有效的方法来nan在numpy数组中"前向填充" 值?(例如,通过使用numpy向量化操作)


更新:解决方案比较

到目前为止,我已尝试计算所有解决方案.这是我的设置脚本:

import numba as nb
import numpy as np
import pandas as pd

def random_array():
    choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
    out = np.random.choice(choices, size=(1000, 10))
    return out

def loops_fill(arr):
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

@nb.jit
def numba_loops_fill(arr):
    '''Numba decorator solution provided by shx2.'''
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

def pandas_fill(arr):
    df = pd.DataFrame(arr)
    df.fillna(method='ffill', axis=1, inplace=True)
    out = df.as_matrix()
    return out

def numpy_fill(arr):
    '''Solution provided by Divakar.'''
    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out
Run Code Online (Sandbox Code Playgroud)

然后是这个控制台输入:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())
Run Code Online (Sandbox Code Playgroud)

导致此控制台输出:

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop
Run Code Online (Sandbox Code Playgroud)

Div*_*kar 37

这是一种方法 -

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
Run Code Online (Sandbox Code Playgroud)

如果您不想创建另一个数组并且只是填充NaNs,请arr用此替换最后一步 -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
Run Code Online (Sandbox Code Playgroud)

样本输入,输出 -

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])
Run Code Online (Sandbox Code Playgroud)

  • 一个矢量化的numpy-only解决方案,很好.谢谢!这个解决方案确实比基于循环和基于熊猫的解决方案更快(参见更新问题中的时间). (3认同)
  • 如何使这个解决方案适应 arr 是*一维* numpy 数组的情况?像`numpy.array([0.83, 0.83, 0.83, 0.83, nan, nan, nan])`? (2认同)
  • @user189035 将 `mask.shape[1]` 替换为 `mask.size` 并删除 `axis=1` 并将最后一行替换为 `out = arr[idx]` (2认同)

Ric*_*ieV 8

我喜欢 Divakar 对 pure numpy 的回答。这是 n 维数组的广义函数:

def np_ffill(arr, axis):
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]
Run Code Online (Sandbox Code Playgroud)

尽管有多索引来弥补,但 AFIK pandas 只能处理二维。实现此目的的唯一方法是展平 DataFrame、取消堆叠所需级别、重新堆叠,最后重新整形为原始形状。这种拆栈/重新堆叠/重塑(涉及 pandas 排序)只是实现相同结果的不必要的开销。

测试:

def random_array(shape):
    choices = [1, 2, 3, 4, np.nan]
    out = np.random.choice(choices, size=shape)
    return out

ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit
Run Code Online (Sandbox Code Playgroud)

输出:

arr
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3. nan  4.  4.  3.]
  [ 3.  2. nan  4. nan nan  3.  4.]
  [ 2.  2.  2. nan  1.  1. nan  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1. nan]
  [ 4.  2. nan  4.  4.  3. nan  4.]
  [ 2.  4.  2.  1.  4.  1.  3. nan]]]

ffull
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3.  4.  4.  4.  3.]
  [ 3.  2.  1.  4.  4.  4.  3.  4.]
  [ 2.  2.  2.  4.  1.  1.  3.  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1.  3.]
  [ 4.  2.  1.  4.  4.  3.  1.  4.]
  [ 2.  4.  2.  1.  4.  1.  3.  4.]]]
Run Code Online (Sandbox Code Playgroud)


cch*_*ala 7

更新:正如 Financial_Physician 在评论中指出的那样,我最初提出的解决方案可以简单地ffill在反转数组上进行交换,然后反转结果。不存在相关的性能损失。根据 ,我最初的解决方案似乎快了 2% 或 3% %timeit。我更新了下面的代码示例,但保留了最初的文本。


对于那些来这里寻找 NaN 值的向后填充的人,我修改了上面 Divakar 提供的解决方案来做到这一点。诀窍在于,您必须使用除最大值之外的最小值对反转数组进行累加。

这是代码:


# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), 0)
    np.maximum.accumulate(idx, axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr): 
    return ffill(arr[:, ::-1])[:, ::-1]

# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
    idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out


# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)

print('\nffill')
print(ffill(arr))

print('\nbfill')
print(bfill(arr))

Run Code Online (Sandbox Code Playgroud)

输出:

Array:
[[ 5. nan nan  7.  2.]
 [ 3. nan  1.  8. nan]
 [ 4.  9.  6. nan nan]]

ffill
[[5. 5. 5. 7. 2.]
 [3. 3. 1. 8. 8.]
 [4. 9. 6. 6. 6.]]

bfill
[[ 5.  7.  7.  7.  2.]
 [ 3.  1.  1.  8. nan]
 [ 4.  9.  6. nan nan]]
Run Code Online (Sandbox Code Playgroud)

编辑:根据MS_的评论更新


shx*_*hx2 5

使用Numba。这将大大提高速度:

import numba
@numba.jit
def loops_fill(arr):
    ...
Run Code Online (Sandbox Code Playgroud)


Jos*_*lez 5

瓶颈推送功能是向前填充的一个不错的选择。它通常在 Xarray 等软件包内部使用,它应该比其他替代方案更快,并且该软件包还有一组基准测试

例子:

import numpy as np

from bottleneck import push

a = np.array(
    [
        [1, np.nan, 3],
        [np.nan, 3, 2],
        [2, np.nan, np.nan]
    ]
)
push(a, axis=0)
array([[ 1., nan,  3.],
       [ 1.,  3.,  2.],
       [ 2.,  3.,  2.]])
Run Code Online (Sandbox Code Playgroud)