寻求 Numpy 中计算量大的数学函数的优化

Question

寻求 Numpy 中计算量大的数学函数的优化

sam*_*lfe 2 python optimization performance numpy vectorization

最近在开发一个Python脚本，实现一个特定的数学函数，如下图所示

其中指数化是周期性的并且1 <= j <= n. 功能比较复杂，受到之前一个问题的启发。该代码的主要目的是评估大型数组x（大小为 5000 或更大）的数学函数。以下是该函数在 Python 中的当前实现：

import numpy as np
import sys, os

def compute_y(x, L, v):
    n = len(x)
    y = np.zeros(n)

    for k in range(L+1):
        # Generate the indices for the window around each j, with periodic wrapping
        indices = np.arange(-k, k+1)

        # Compute the weights
        weights_1 = k - np.abs(indices)
        weights_2 = k + 1 - np.abs(indices)
        weights_d = np.ones(2*k+1)

        # For each j, take the elements in the window around j and multiply by the weights
        x_matrix = np.take(x, (np.arange(n)[:, None] + indices) % n, mode='wrap')
        
        exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)
        exp_2 = np.exp(-np.sum(weights_2[None, :] * x_matrix, axis=1)/v)
        denom = np.sum(weights_d[None, :] * x_matrix, axis=1)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/denom

    return y

# Test the function
x = np.random.rand(5000)
L = len(x)//2
v = 1.4
y = compute_y(x, L, v)

Run Code Online (Sandbox Code Playgroud)

我当前面临的问题是，该代码虽然是函数式且矢量化的，但速度明显慢于预期，特别是在应用于大型数组时。我相信这种缓慢的主要原因是 for 循环，它处理生成索引、计算权重以及求和和计算指数。

因此，我正在寻找有关如何加速此代码的指导和建议，特别是对于大小为 5000 或更大的数组。我对通过 Numpy 利用矢量化来加速计算的方法特别感兴趣。

任何帮助将非常感激。先感谢您！

Answer 1

Nic*_*ell 5

卷积

我认为最重要的是，这或多或少是一个卷积：

exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)

Run Code Online (Sandbox Code Playgroud)

如果查看 x_matrix 的值，它们是每个 x 值周围位置 -1, 0, 1, ... 的 x 的 x 值。然后乘以权重。这就是一个卷积。

我们为什么关心？这很有帮助，因为有人已经制作了用于进行卷积的快速库。

这里的核心思想是np.take()用三个卷积替换，并避免创建x_matrix数组。

exp_1 = np.exp(-np.sum(weights_1[None, :] * x_matrix, axis=1)/v)

Run Code Online (Sandbox Code Playgroud)

（为什么在进行卷积之前会创建 x 数组的 3 个副本？在每个数组的边缘，您希望它环绕并访问数组另一端的元素，但是scipy.signal.convolve只会将卷积的这些部分视为零。）

这工作得很好，在 5000 个元素的数组上实现了 141 倍的加速。（718 秒 vs. 5.06 秒）

重用卷积

卷积非常昂贵，在前面的示例中，我们最终在每个循环中执行了三个卷积。我们可以做得更好吗？

让我们打印出每个循环中卷积使用的权重：

k=0
weights_1=array([0])
weights_2=array([1])
weights_d=array([1.])
k=1
weights_1=array([0, 1, 0])
weights_2=array([1, 2, 1])
weights_d=array([1., 1., 1.])

Run Code Online (Sandbox Code Playgroud)

我们可以注意到三件事：

分母中的权重均为一，这是均匀的滤波器响应。scipy有专门的函数，对于统一过滤器来说速度更快。
的值weights_2等于的值weights_1加上均匀滤波器。
的值weights_1等于weights_2上一个循环中的值。

利用这些观察结果，我们可以将 3 个卷积变为 1 个卷积。

def compute_y_reuse(x, L, v):
    n = len(x)
    y = np.zeros(n)

    last_exp_2_raw = np.zeros(n)
    for k in range(L+1):
        uniform = scipy.ndimage.uniform_filter(x, size=2*k + 1, mode='wrap') * (2*k + 1)
        exp_1_raw = last_exp_2_raw
        exp_1 = np.exp(-exp_1_raw/v)
        exp_2_raw = exp_1_raw + uniform
        exp_2 = np.exp(-exp_2_raw/v)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/uniform
        
        last_exp_2_raw = exp_2_raw

    return y

Run Code Online (Sandbox Code Playgroud)

与 5000 个元素阵列上的原始速度相比，这实现了 1550 倍的加速。（718 秒与 0.462 秒）

删除最后一个卷积

我进一步研究了这一点，并尝试删除最后一个卷积。本质上，这个想法是，在上一个循环中，我们计算了 N 个最接近元素的总和，而下一个循环，我们计算了 N+2 个最接近元素的总和，因此我们可以将最接近的 2 个元素相加。边缘。

我尝试使用np.roll()它，但发现它比慢uniform_filter()，因为它必须复制数组。最终我找到了这个线程，它让我弄清楚如何解决这个问题。

此外，由于与exp_1_raw前一次迭代的相同，我们可以通过保存该迭代的输出来exp_2_raw重新使用该调用。np.exp()

import scipy.signal

def compute_y_convolution(x, L, v):
    n = len(x)
    y = np.zeros(n)
    x3 = np.tile(x, 3)

    for k in range(L+1):
        # Generate the indices for the window around each j, with periodic wrapping
        indices = np.arange(-k, k+1)

        # Compute the weights
        weights_1 = k - np.abs(indices)
        weights_2 = k + 1 - np.abs(indices)
        weights_d = np.ones(2*k + 1)

        conv = scipy.signal.convolve(x3, weights_d, mode='same')[n:-n]
        exp_1 = np.exp(-(scipy.signal.convolve(x3, weights_1, mode='same')[n:-n])/v)
        exp_2 = np.exp(-(scipy.signal.convolve(x3, weights_2, mode='same')[n:-n])/v)

        # Compute the weighted sum for each j and add to the total
        y += (exp_1 - exp_2)/conv

    return y

Run Code Online (Sandbox Code Playgroud)

与 5000 个元素阵列上的原始速度相比，这实现了 3100 倍的加速。（718 秒 vs. 0.225 秒）它也不再需要 scipy 作为依赖项。

@samwolfe 今天我又看了一遍，通过重新使用 exp_1 的上一个循环中的 `np.exp()` 调用，将最终结果提高了 30%。 (2认同)

归档时间：	2 年，1 月前
查看次数：	165 次
最近记录：	2 年，1 月前