Cython中C++函数性能不佳

Question

Cython中C++函数性能不佳

我有这个C++函数,我可以使用下面的代码从Python调用它.与运行纯C++相比,性能只有一半.有没有办法让他们的表现达到同一水平？我用-Ofast -march=native标志编译两个代码.我不明白我可以在哪里丢失50%,因为大部分时间都应该花在C++内核上.Cython是否制作了我可以避免的内存副本？

namespace diff
{
    void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc,
                  const double dxidxi, const double dyidyi, const double dzidzi,
                  const int itot, const int jtot, const int ktot)
    {
        const int ii = 1;
        const int jj = itot;
        const int kk = itot*jtot;

        for (int k=1; k<ktot-1; k++)
            for (int j=1; j<jtot-1; j++)
                for (int i=1; i<itot-1; i++)
                {
                    const int ijk = i + j*jj + k*kk;
                    at[ijk] += visc * (
                            + ( (a[ijk+ii] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                            + ( (a[ijk+jj] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                            + ( (a[ijk+kk] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                            );
                }
    }
}

Run Code Online (Sandbox Code Playgroud)

我有这个.pyx文件

# import both numpy and the Cython declarations for numpy
import cython
import numpy as np
cimport numpy as np

# declare the interface to the C code
cdef extern from "diff_cpp.cpp" namespace "diff":
    void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot)

@cython.boundscheck(False)
@cython.wraparound(False)
def diff(np.ndarray[double, ndim=3, mode="c"] at not None,
         np.ndarray[double, ndim=3, mode="c"] a not None,
         double visc, double dxidxi, double dyidyi, double dzidzi):
    cdef int ktot, jtot, itot
    ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2]
    diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot)
    return None

Run Code Online (Sandbox Code Playgroud)

我在Python中调用此函数

import numpy as np
import diff
import time

nloop = 20;
itot = 256;
jtot = 256;
ktot = 256;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot))

index = np.arange(ncells)
a = (index/(index+1))**2
a.shape = (ktot, jtot, itot)

# Check results
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
print("at={0}".format(at.flatten()[itot*jtot+itot+itot//2]))

# Time the loop
start = time.perf_counter()
for i in range(nloop):
    diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
end = time.perf_counter()

print("Time/iter: {0} s ({1} iters)".format((end-start)/nloop, nloop))

Run Code Online (Sandbox Code Playgroud)

这是setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy

setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = [Extension("diff",
                             sources=["diff.pyx"],
                             language="c++",
                             extra_compile_args=["-Ofast -march=native"],
                             include_dirs=[numpy.get_include()])],
)

Run Code Online (Sandbox Code Playgroud)

这里的C++参考文件达到了两倍的性能:

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <stdlib.h>
#include <cstdio>
#include <ctime>
#include "math.h"

void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells)
{
    for (int i=0; i<ncells; ++i)
    {
        a[i]  = pow(i,2)/pow(i+1,2);
        at[i] = 0.;
    }
}

void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc, 
          const double dxidxi, const double dyidyi, const double dzidzi, 
          const int itot, const int jtot, const int ktot)
{
    const int ii = 1;
    const int jj = itot;
    const int kk = itot*jtot;

    for (int k=1; k<ktot-1; k++)
        for (int j=1; j<jtot-1; j++)
            for (int i=1; i<itot-1; i++)
            {
                const int ijk = i + j*jj + k*kk;
                at[ijk] += visc * (
                        + ( (a[ijk+ii] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                        + ( (a[ijk+jj] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                        + ( (a[ijk+kk] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                        );
            }
}

int main()
{
    const int nloop = 20;
    const int itot = 256;
    const int jtot = 256;
    const int ktot = 256;
    const int ncells = itot*jtot*ktot;

    double *a  = new double[ncells];
    double *at = new double[ncells];

    init(a, at, ncells);

    // Check results
    diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 
    printf("at=%.20f\n",at[itot*jtot+itot+itot/2]);

    // Time performance 
    std::clock_t start = std::clock(); 

    for (int i=0; i<nloop; ++i)
        diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 

    double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC;

    printf("time/iter = %f s (%i iters)\n",duration/(double)nloop, nloop);

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

ead*_*ead 5

这里的问题不是运行期间发生的事情，而是编译期间发生的优化。

哪个优化完成取决于编译器（甚至版本），并且不能保证可以完成的每个优化都会完成。

实际上，取决于您使用g ++还是clang ++，cython变慢的原因有两个：

由于-fwrapvcython版本中的标志，g ++无法优化
clang ++首先无法进行优化（请继续阅读以了解发生了什么）。

第一个问题（g ++）：与纯c ++程序的标志相比，Cython用不同的标志进行编译，结果无法进行某些优化。

如果查看设置日志，将会看到：

 x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native

Run Code Online (Sandbox Code Playgroud)

如您所言，-Ofast将获胜，-O2因为它排在最后。但是问题是-fwrapv，这似乎阻止了一些优化，因为带符号的整数溢出不能被视为UB，并且不再用于优化。

因此，您有以下选择：

添加-fno-wrapv到extra_compile_flags，缺点是，所有的文件现在改变标志编译，什么可能是不必要的。
使用仅包含您喜欢的标志的cpp构建一个库，并将其链接到cython模块。该解决方案有一些开销，但具有健壮的优势：正如您指出的，对于不同的编译器，不同的cython标志可能是问题所在-因此第一个解决方案可能太脆弱了。
不确定是否可以禁用默认标志，但是文档中可能包含一些信息。

内联在测试cpp程序中的第二个问题（clang ++）。

当我用相当老的5.4版本g ++编译您的cpp程序时：

g++ test.cpp -o test -Ofast -march=native -fwrapv
Run Code Online (Sandbox Code Playgroud)
与没有的编译相比，它慢了将近3倍-fwrapv。但是，这是优化程序的一个弱点：进行内联时，应该看到没有可能发生带符号整数溢出（所有维都约为256），因此该标志-fwrapv不会产生任何影响。

我的旧版本clang++（3.8）在这里似乎做得更好：使用上面的标志，我看不到任何性能下降。我需要禁用内联过孔-fno-inline才能成为较慢的代码，但是即使没有-fwrapvie，它也较慢：

clang++ test.cpp -o test -Ofast -march=native -fno-inline
Run Code Online (Sandbox Code Playgroud)
因此，系统上倾向于使用c ++程序：内联后，优化器可以针对已知值优化代码-cython无法做到的事情。

因此我们可以看到：clang ++不能function diff使用任意大小进行优化，但是可以将其优化为size = 256。不过，Cython只能使用未优化的版本diff。这就是为什么-fno-wrapv没有积极影响的原因。

我的收获：禁止在cpp-tester中内联感兴趣的功能（例如，将其编译到自己的目标文件中），以确保与cython保持平衡；否则，人们会看到为此目的专门优化的程序的性能一个输入。

注意：有趣的是，如果所有ints都被s取代unsigned int，那么自然-fwrapv不会发挥任何作用，但是with的版本与-version with unsigned int一样慢，这只是逻辑上的，因为没有未定义的行为是被利用。int-fwrapv

归档时间：	8 年，5 月前
查看次数：	852 次
最近记录：	8 年，5 月前