Python numpy 代码比 eigen3 或普通 C++ 更高效

Question

Python numpy 代码比 eigen3 或普通 C++ 更高效

Ser*_*mal 5 c++ python performance numpy eigen3

我在 Python3（使用 numpy）中有一些代码，我想将它们转换为 C++（使用 eigen3）以获得更高效的程序。所以我决定测试一个简单的例子来评估我将获得的性能提升。代码由两个随机数组组成，这些数组要按系数相乘。我的结论是 numpy 的 python 代码比 C++ 的代码快 30%。我想知道为什么解释的 python 代码比编译的 C++ 代码快。我在 C++ 代码中遗漏了什么吗？

我正在使用 gcc 9.1.0、Eigen 3.3.7、Python 3.7.3 和 Numpy 1.16.4。

可能的解释：

C++ 程序没有使用矢量化
Numpy 比我想象的要优化得多
Time 测量每个程序中的不同内容

Stack Overflow ( Eigen Matrix vs Numpy Array multiplication performance ) 中有一个类似的问题。我在我的电脑上测试了这个并得到了预期的结果，即 eigen 比 numpy 更有效，但这里的操作是矩阵乘法而不是系数乘法。

Python代码（main.py）
执行命令：python3 main.py

import numpy as np
import time

Lx = 4096
Ly = 4000

# Filling arrays
a = np.random.rand(Lx, Ly).astype(np.float64)
a1 = np.random.rand(Lx, Ly).astype(np.float64)

# Coefficient-wise product
start = time.time()
b = a*a1

# Compute the elapsed time
end = time.time()

print(b.sum())
print("duration: ", end-start)

Run Code Online (Sandbox Code Playgroud)

带有eigen3（main_eigen.cpp）的C++代码
编译命令：g++ -O3 -I/usr/include/eigen3/ main_eigen.cpp -o prog_eigen

#include <iostream>
#include <chrono>
#include "Eigen/Dense"

#define Lx 4096
#define Ly 4000
typedef double T;

int main(){

    // Allocating arrays
    Eigen::Array<T, -1, -1> KPM_ghosts(Lx, Ly), KPM_ghosts1(Lx, Ly), b(Lx,Ly);

    // Filling the arrays
    KPM_ghosts.setRandom();
    KPM_ghosts1.setRandom();

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    b = KPM_ghosts*KPM_ghosts1;

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    // Print the sum so the compiler doesn't optimize the code away
    std::cout << b.sum() << "\n";

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

纯C++代码（main.cpp）
编译命令：g++ -O3 main.cpp -o prog

#include <iostream>
#include <chrono>

#define Lx 4096
#define Ly 4000
#define N Lx*Ly
typedef double T;

int main(){
    // Allocating arrays
    T lin_vector1[N];
    T lin_vector2[N];
    T lin_vector3[N];

    // Filling the arrays
    for(unsigned i = 0; i < N; i++){
        lin_vector1[i] = std::rand()*1.0/RAND_MAX;
        lin_vector2[i] = std::rand()*1.0/RAND_MAX;
    }

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    for(unsigned i = 0; i < N; i++)
        lin_vector3[i] = lin_vector1[i]*lin_vector2[i];

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    // Print the sum so the compiler doesn't optimize the code away
    double sum = 0;
    for(unsigned i = 0; i < N; i++)
        sum += lin_vector3[i];
    std::cout << "sum: " << sum << "\n";


    return 0;
}

Run Code Online (Sandbox Code Playgroud)

每个程序运行10次

普通的C ++
的经过时间：0.210664s
经过时间：0.215406s
经过时间：0.222483s
经过时间：0.21526s
的经过时间：0.216346s
经过时间：0.218951s
经过时间：0.21587s
的经过时间：0.213639s
经过时间：0.219399s
经过时间： 0.213403s

普通的C ++与eigen3
经过时间：0.21052s
的经过时间：0.220779s
经过时间：0.216269s
经过时间：0.229234s
经过时间：0.212265s
经过时间：0.256714s
经过时间：0.212396s
经过时间：0.248241s
经过时间：0.241537s
经过时间：0.323519s

Python的
时间：0.23946428298950195
时间：0.1663036346435547
时间：0.17225909233093262
时间：0.15922021865844727
时间：0.16628384590148926
时间：0.15654635429382324
时间：0.15859222412109375
时间：0.1633443832397461
时间：0.1685199737548828
时间：0.16393446922302246

Answer 1

小智 0

我想在上述评论中添加一些假设。

一是numpy正在做多线程。您的 C++ 是使用 -O3 编译的，这通常已经提供了很好的加速。我假设 numpy 没有使用默认 PyPI 包中的 -O3 或其他优化进行编译。但它的速度要快得多。实现这种情况的一种方法是，如果一开始速度很慢，但使用了多个 CPU 核心。

一种检查方法是通过设置此处提到的变量使其仅使用一个线程：

OMP_NUM_THREADS=1 MPI_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1

或者，或者与上述同时，也可能是由于优化的构建，例如可以从 Anaconda 安装的 MKL 构建。正如上面的评论所建议的，您还可以看到在 C++ 代码中使用 SSE 或 AVX 在多大程度上提高了其性能，使用诸如-march=native.

归档时间：	6 年，7 月前
查看次数：	669 次
最近记录：	6 年，7 月前