是什么导致我的Cython矩阵向量乘法实现2x减速？

Question

是什么导致我的Cython矩阵向量乘法实现2x减速？

Ber*_* U. 8 python numpy matrix linear-algebra cython

我目前正在尝试在Cython中实现基本的矩阵向量乘法(作为一个更大的项目的一部分,以减少计算),并发现我的代码比大约慢2 倍Numpy.dot.

我想知道是否有一些我错过的导致减速的东西.我正在编写优化的Cython代码,声明变量类型,需要连续的数组,并避免缓存未命中.我甚至尝试将Cython作为包装器并调用本机C代码(见下文).

我想知道:我还能做些什么来加快我的实施速度,因此NumPy可以像这个基本操作一样快速运行？

我正在使用的Cython代码是beow:

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64;
ctypedef np.float64_t DTYPE_T

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def matrix_vector_multiplication(np.ndarray[DTYPE_T, ndim=2] A, np.ndarray[DTYPE_T, ndim=1] x):

    cdef Py_ssize_t i, j
    cdef Py_ssize_t N = A.shape[0]
    cdef Py_ssize_t D = A.shape[1]
    cdef np.ndarray[DTYPE_T, ndim=1] y = np.empty(N, dtype = DTYPE)
    cdef DTYPE_T val

    for i in range(N):
        val = 0.0
        for j in range(D):
            val += A[i,j] * x[j]
        y[i] = val
    return y

Run Code Online (Sandbox Code Playgroud)

我正在seMatrixVectorExample.pyx使用以下脚本编译此文件():

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np

ext_modules=[ Extension("seMatrixVectorExample",
                        ["seMatrixVectorExample.pyx"],
                        libraries=["m"],
                        extra_compile_args = ["-ffast-math"])]

setup(
    name = "seMatrixVectorExample",
    cmdclass = {"build_ext": build_ext},
    include_dirs = [np.get_include()],
    ext_modules = ext_modules
)

Run Code Online (Sandbox Code Playgroud)

并使用以下测试脚本来评估性能:

import numpy as np
from seMatrixVectorExample import matrix_vector_multiplication
import time

n_rows, n_cols = 1e6, 100
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
A = np.random.random(size=(n_rows, n_cols))
np.require(A, requirements = ['C'])

x = np.random.random(size=n_cols)
x = np.require(x, requirements = ['C'])

start_time = time.time()
scores = matrix_vector_multiplication(A, x)
print "cython runtime = %1.5f seconds" % (time.time() - start_time)

start_time = time.time()
py_scores = np.exp(A.dot(x))
print "numpy runtime = %1.5f seconds" % (time.time() - start_time)

Run Code Online (Sandbox Code Playgroud)

对于测试矩阵n_rows = 10e6和n_cols = 100我得到:

cython runtime = 0.08852 seconds
numpy runtime = 0.04372 seconds

Run Code Online (Sandbox Code Playgroud)

编辑:值得一提的是,即使我在本机C代码中实现矩阵乘法,并且仅使用Cython作为包装器,减速仍然存在.

void c_matrix_vector_multiplication(double* y, double* A, double* x, int N, int D) {

    int i, j;
    int index = 0;
    double val;

    for (i = 0; i < N; i++) {
        val = 0.0;
        for (j = 0; j < D; j++) {
            val = val + A[index] * x[j];
            index++;
            }
        y[i] = val;
        }
    return; 
}

Run Code Online (Sandbox Code Playgroud)

这里是Cython包装器,它只是将指针发送到的第一个元素y,A和x.:

import cython
import numpy as np
cimport numpy as np

DTYPE = np.float64;
ctypedef np.float64_t DTYPE_T

# declare the interface to the C code
cdef extern void c_multiply (double* y, double* A, double* x, int N, int D)

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def multiply(np.ndarray[DTYPE_T, ndim=2, mode="c"] A, np.ndarray[DTYPE_T, ndim=1, mode="c"] x):

    cdef int N = A.shape[0]
    cdef int D = A.shape[1]
    cdef np.ndarray[DTYPE_T, ndim=1, mode = "c"] y = np.empty(N, dtype = DTYPE)

    c_multiply (&y[0], &A[0,0], &x[0], N, D)

    return y

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ber*_* U. 3

OK 最终成功获得了比 NumPy 更好的运行时！

以下是（我认为）造成差异的原因：NumPy 正在调用 BLAS 函数，这些函数是用 Fortran 而不是 C 编码的，导致速度差异。

我认为这一点值得注意，因为我之前的印象是 BLAS 函数是用 C 编写的，并且不明白为什么它们的运行速度明显快于我在问题中发布的第二个本机 C 实现。

无论哪种情况，我现在都可以通过使用 Cython + SciPy Cython BLAS 函数指针来复制性能scipy.linalg.cython_blas.

为了完整起见，这里是新的 Cython 代码blas_multiply.pyx：

import cython
import numpy as np
cimport numpy as np
cimport scipy.linalg.cython_blas as blas

DTYPE = np.float64
ctypedef np.float64_t DTYPE_T

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)

def blas_multiply(np.ndarray[DTYPE_T, ndim=2, mode="fortran"] A, np.ndarray[DTYPE_T, ndim=1, mode="fortran"] x):
    #calls dgemv from BLAS which computes y = alpha * trans(A) + beta * y
    #see: http://www.nag.com/numeric/fl/nagdoc_fl22/xhtml/F06/f06paf.xml

    cdef int N = A.shape[0]
    cdef int D = A.shape[1]
    cdef int lda = N
    cdef int incx = 1 #increments of x
    cdef int incy = 1 #increments of y
    cdef double alpha = 1.0
    cdef double beta = 0.0
    cdef np.ndarray[DTYPE_T, ndim=1, mode = "fortran"] y = np.empty(N, dtype = DTYPE)

    blas.dgemv("N", &N, &D, &alpha, &A[0,0], &lda, &x[0], &incx, &beta, &y[0], &incy)

    return y

Run Code Online (Sandbox Code Playgroud)

这是我用来构建的代码：

!/usr/bin/env python

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy
import scipy

ext_modules=[ Extension("blas_multiply",
                        sources=["blas_multiply.pyx"],
                        include_dirs=[numpy.get_include(), scipy.get_include()],
                        libraries=["m"],
                        extra_compile_args = ["-ffast-math"])]

setup(
    cmdclass = {'build_ext': build_ext},
    include_dirs = [numpy.get_include(), scipy.get_include()],
    ext_modules = ext_modules,
)

Run Code Online (Sandbox Code Playgroud)

这是测试代码（请注意，F_CONTIGUOUS现在传递给 BLAS 函数的数组是）

import numpy as np
from blas_multiply import blas_multiply
import time

#np.__config__.show()
n_rows, n_cols = 1e6, 100
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
X = np.random.random(size=(n_rows, n_cols))
Y = np.random.randint(low=0, high=2, size=(n_rows, 1))
Y[Y==0] = -1
Z = X*Y
Z.flags
Z = np.require(Z, requirements = ['F'])

rho_test = np.random.randint(low=-10, high=10, size= n_cols)
set_to_zero = np.random.choice(range(0, n_cols), size =(np.floor(n_cols/2), 1), replace=False)
rho_test[set_to_zero] = 0.0
rho_test = np.require(rho_test, dtype=Z.dtype, requirements = ['F'])

start_time = time.time()
scores = blas_multiply(Z, rho_test)
print "Cython runtime = %1.5f seconds" % (time.time() - start_time)


Z = np.require(Z, requirements = ['C'])
rho_test = np.require(rho_test, requirements = ['C'])
start_time = time.time()
py_scores = np.exp(Z.dot(rho_test))
print "Python runtime = %1.5f seconds" % (time.time() - start_time)

Run Code Online (Sandbox Code Playgroud)

在我的机器上测试的结果是：

Cython runtime = 0.04556 seconds
Python runtime = 0.05110 seconds

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	523 次
最近记录：	9 年，11 月前