大型 3D 阵列上的快速 1D 线性 np.NaN 插值

Ker*_*ten 5 python interpolation numpy scipy pandas

我有一个 3D 数组(z, y, x)shape=(92, 4800, 4800)其中每个值都axis 0代表不同的时间点。在少数情况下,时域中的值获取失败,导致某些值变为np.NaN。在其他情况下,没有获取任何值,所有值z都是np.NaN

什么是使用线性插值来填充最有效的方式np.NaN沿着axis 0无视所有的值是实例np.NaN

这是我正在做的工作的一个工作示例,它使用pandas包装器到scipy.interpolate.interp1d. 在原始数据集上,每个切片大约需要 2 秒,这意味着整个数组的处理时间为 2.6 小时。减小大小的示例数据集大约需要 9.5 秒。

import numpy as np
import pandas as pd

# create example data, original is (92, 4800, 4800)
test_arr = np.random.randint(low=-10000, high=10000, size=(92, 480, 480))
test_arr[1:90:7, :, :] = -32768  # NaN fill value in original data
test_arr[:, 1:90:6, 1:90:8] = -32768

def interpolate_nan(arr, method="linear", limit=3):
    """return array interpolated along time-axis to fill missing values"""
    result = np.zeros_like(arr, dtype=np.int16)

    for i in range(arr.shape[1]):
        # slice along y axis, interpolate with pandas wrapper to interp1d
        line_stack = pd.DataFrame(data=arr[:,i,:], dtype=np.float32)
        line_stack.replace(to_replace=-37268, value=np.NaN, inplace=True)
        line_stack.interpolate(method=method, axis=0, inplace=True, limit=limit)
        line_stack.replace(to_replace=np.NaN, value=-37268, inplace=True)
        result[:, i, :] = line_stack.values.astype(np.int16)
    return result
Run Code Online (Sandbox Code Playgroud)

使用示例数据集在我的机器上的性能:

%timeit interpolate_nan(test_arr)
1 loops, best of 3: 9.51 s per loop
Run Code Online (Sandbox Code Playgroud)

编辑:

我应该澄清代码正在产生我的预期结果。问题是 - 我怎样才能优化这个过程?

Ker*_*ten 5

我最近在numba的帮助下为我的特定用例解决了这个问题,并且还写了一些关于它的文章

from numba import jit

@jit(nopython=True)
def interpolate_numba(arr, no_data=-32768):
    """return array interpolated along time-axis to fill missing values"""
    result = np.zeros_like(arr, dtype=np.int16)

    for x in range(arr.shape[2]):
        # slice along x axis
        for y in range(arr.shape[1]):
            # slice along y axis
            for z in range(arr.shape[0]):
                value = arr[z,y,x]
                if z == 0:  # don't interpolate first value
                    new_value = value
                elif z == len(arr[:,0,0])-1:  # don't interpolate last value
                    new_value = value

                elif value == no_data:  # interpolate

                    left = arr[z-1,y,x]
                    right = arr[z+1,y,x]
                    # look for valid neighbours
                    if left != no_data and right != no_data:  # left and right are valid
                        new_value = (left + right) / 2

                    elif left == no_data and z == 1:  # boundary condition left
                        new_value = value
                    elif right == no_data and z == len(arr[:,0,0])-2:  # boundary condition right
                        new_value = value

                    elif left == no_data and right != no_data:  # take second neighbour to the left
                        more_left = arr[z-2,y,x]
                        if more_left == no_data:
                            new_value = value
                        else:
                            new_value = (more_left + right) / 2

                    elif left != no_data and right == no_data:  # take second neighbour to the right
                        more_right = arr[z+2,y,x]
                        if more_right == no_data:
                            new_value = value
                        else:
                            new_value = (more_right + left) / 2

                    elif left == no_data and right == no_data:  # take second neighbour on both sides
                        more_left = arr[z-2,y,x]
                        more_right = arr[z+2,y,x]
                        if more_left != no_data and more_right != no_data:
                            new_value = (more_left + more_right) / 2
                        else:
                            new_value = value
                    else:
                        new_value = value
                else:
                    new_value = value
                result[z,y,x] = int(new_value)
    return result
Run Code Online (Sandbox Code Playgroud)

这比我的初始代码大约20 倍