C++ OpenMP 无法加速矩阵乘法

Question

C++ OpenMP 无法加速矩阵乘法

Ξέν*_*νος 3 c++ openmp matrix-multiplication c++20

我用 C++ 编写了一个简单的矩阵乘法程序，并且它有效。我只是 C++ 的初学者，但我已经成功地让它工作了。

\n

让我困惑的是它比 NumPy 慢得多。我已经对它进行了基准测试。

\n

因此，我尝试使用 OpenMP 来加速，但我观察到性能完全没有变化：

\n

#include <algorithm>\n#include <chrono>\n#include <iostream>\n#include <omp.h>\n#include <string>\n#include <vector>\n\n\nusing std::vector;\nusing std::chrono::high_resolution_clock;\nusing std::chrono::duration;\nusing std::chrono::duration_cast;\nusing std::chrono::microseconds;\nusing std::cout;\nusing line = vector<double>;\nusing matrix = vector<line>;\n\n\nvoid fill(line &l) {\n    std::generate(l.begin(), l.end(), []() { return ((double)rand() / (RAND_MAX)); });\n}\n\nmatrix random_matrx(int64_t height, int64_t width) {\n    matrix mat(height, line(width));\n    std::for_each(mat.begin(), mat.end(), fill);\n    return mat;\n}\n\nmatrix dot_product(const matrix &mat0, const matrix &mat1) {\n    size_t h0, w0, h1, w1;\n    h0 = mat0.size();\n    w0 = mat0[0].size();\n    h1 = mat1.size();\n    w1 = mat1[0].size();\n    if (w0 != h1) {\n        throw std::invalid_argument("matrices cannot be cross multiplied");\n    }\n\n    matrix out(h0, line(w1));\n    for (int y = 0; y < h0; y++) {\n        for (int x = 0; x < w1; x++) {\n            double s = 0;\n            for (int z = 0; z < w0; z++) {\n                s += mat0[y][z] * mat1[z][x];\n            }\n            out[y][x] = s;\n        }\n    }\n\n    return out;\n}\n\nmatrix dot_product_omp(const matrix& mat0, const matrix& mat1) {\n    size_t h0, w0, h1, w1;\n    h0 = mat0.size();\n    w0 = mat0[0].size();\n    h1 = mat1.size();\n    w1 = mat1[0].size();\n    if (w0 != h1) {\n        throw std::invalid_argument("matrices cannot be cross multiplied");\n    }\n\n    matrix out(h0, line(w1));\n    omp_set_num_threads(4);\n    #pragma omp parallel for private(y, x, z) schedule(dynamic)\n    for (int y = 0; y < h0; y++) {\n        for (int x = 0; x < w1; x++) {\n            double s = 0;\n            for (int z = 0; z < w0; z++) {\n                s += mat0[y][z] * mat1[z][x];\n            }\n            out[y][x] = s;\n        }\n    }\n\n    return out;\n}\n\n\nint main()\n{\n    matrix a, b;\n    a = random_matrx(16, 9);\n    b = random_matrx(9, 24);\n    auto start = high_resolution_clock::now();\n    for (int64_t i = 0; i < 65536; i++) {\n        dot_product(a, b);\n    }\n    auto end = high_resolution_clock::now();\n    duration<double, std::nano> time = end - start;\n    double once = time.count() / 65536000;\n    cout << "mat(16, 9) * mat(9, 24): " + std::to_string(once) + " microseconds\\n";\n    a = random_matrx(128, 256);\n    b = random_matrx(256, 512);\n    start = high_resolution_clock::now();\n    for (int64_t i = 0; i < 512; i++) {\n        dot_product(a, b);\n    }\n    end = high_resolution_clock::now();\n    time = end - start;\n    once = time.count() / 512000;\n    cout << "mat(128, 256) * mat(256, 512): " + std::to_string(once) + " microseconds\\n";\n    start = high_resolution_clock::now();\n    for (int64_t i = 0; i < 512; i++) {\n        dot_product_omp(a, b);\n    }\n    end = high_resolution_clock::now();\n    time = end - start;\n    once = time.count() / 512000;\n    cout << "mat(128, 256) * mat(256, 512) omp: " + std::to_string(once) + " microseconds\\n";\n}\n

Run Code Online (Sandbox Code Playgroud)\n

PS D:\\MyScript> C:\\Users\\Xeni\\source\\repos\\matmul\\x64\\Release\\matmul.exe\nmat(16, 9) * mat(9, 24): 5.200116 microseconds\nmat(128, 256) * mat(256, 512): 30128.739453 microseconds\nmat(128, 256) * mat(256, 512) omp: 30116.103125 microseconds\n

Run Code Online (Sandbox Code Playgroud)\n

我使用 Visual Studio 2022、C++20 标准、编译器标志编译它：

\n

/permissive- /ifcOutput "x64\\Release\\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /sdl /Fd"x64\\Release\\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\\Release\\" /EHsc /nologo /Fo"x64\\Release\\" /Ot /Fp"x64\\Release\\matmul.pch" /diagnostics:column\n

Run Code Online (Sandbox Code Playgroud)\n

附加标志：

\n

/arch:AVX2 /fp:fast \n

Run Code Online (Sandbox Code Playgroud)\n

只是为什么没有任何改善呢？我怎样才能真正改善它？

\n

我已将 OMP 版本更改为：

\n

matrix dot_product_omp(const matrix& mat0, const matrix& mat1) {\n    size_t h0, w0, h1, w1;\n    h0 = mat0.size();\n    w0 = mat0[0].size();\n    h1 = mat1.size();\n    w1 = mat1[0].size();\n    if (w0 != h1) {\n        throw std::invalid_argument("matrices cannot be cross multiplied");\n    }\n\n    matrix out(h0, line(w1));\n    omp_set_num_threads(4);\n    #pragma omp parallel for schedule(dynamic)\n    for (int y = 0; y < h0; y++) {\n        for (int x = 0; x < w1; x++) {\n            double s = 0;\n            for (int z = 0; z < w0; z++) {\n                s += mat0[y][z] * mat1[z][x];\n            }\n            out[y][x] = s;\n        }\n    }\n\n    return out;\n}\n

Run Code Online (Sandbox Code Playgroud)\n

我用flag进行了编译/openmp，我已经进行了多次基准测试，它只使代码运行时间约为原始时间的四分之一：

\n

PS D:\\MyScript> C:\\Users\\Xeni\\source\\repos\\matmul\\x64\\Release\\matmul.exe\nmat(16, 9) * mat(9, 24): 5.126476 microseconds\nmat(128, 256) * mat(256, 512): 30999.137109 microseconds\nmat(128, 256) * mat(256, 512) omp: 8574.475195 microseconds\n

Run Code Online (Sandbox Code Playgroud)\n

NumPy 更快：

\n

In [374]: a = np.random.random((128, 256))\n\nIn [375]: b = np.random.random((256, 512))\n\nIn [376]: %timeit a @ b\n382 \xc2\xb5s \xc2\xb1 19.6 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1,000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

我的代码慢了一个数量级。那么如何才能缩小性能差距呢？

\n

Answer 1

Jér*_*ard 7

代码中存在很多问题，但关键是内存访问模式效率低下，并且它阻止（几乎）任何矢量化。

访问mat1[z][x]效率低下，因为当z增加时，需要获取新的向量，然后x从内存中获取第 th 项。这会导致两次类似随机的内存读取。这种内存访问比顺序内存访问慢得多。更不用说大多数编译器不会对具有此类内存访问的循环进行矢量化，因为这几乎是不可能的（理论上这对于 SIMD 集合来说是可能的，但实际上效率很低）。最重要的是，缓存行的使用很差：仅使用了与相关的缓存行的 8/64 字节mat1，其余部分被浪费，因为缓存行将很快从缓存中逐出（导致缓存垃圾）。此类问题会导致应用程序在大多数平台上无法很好地扩展，因为它受到内存限制（使用更多内核并不会使 RAM 运行得更快）。您需要连续读取数据以获得更好的性能。这是一个更快的实现：

    #pragma omp parallel for schedule(dynamic)
    for (int y = 0; y < h0; y++) {
        for (int x = 0; x < w1; x++) {
            out[y][x] += 0.0;
        }
        for (int z = 0; z < w0; z++) {
            for (int x = 0; x < w1; x++) {
                out[y][x] += mat0[y][z] * mat1[z][x];
            }
        }
    }

Run Code Online (Sandbox Code Playgroud)

Before:
mat(16, 9) * mat(9, 24): 2.931490 microseconds
mat(128, 256) * mat(256, 512): 14704.138781 microseconds
mat(128, 256) * mat(256, 512) omp: 4013.665295 microseconds

After:
mat(16, 9) * mat(9, 24): 0.931926 microseconds
mat(128, 256) * mat(256, 512): 3296.070098 microseconds
mat(128, 256) * mat(256, 512) omp: 1230.341350 microseconds

Run Code Online (Sandbox Code Playgroud)

顺序代码比以前快 4.3 倍，并行代码现在快 3.3 倍。现在应该对代码进行矢量化。

由于其他因素，该代码仍然相当低效，例如：

缓存未命中/损坏：矩阵被重新加载多次；
FMA/负载比小：CPU 花时间加载数据，同时可以花时间执行 FMA 指令；
内存间接寻址：vector<vector<double>>是一种存储矩阵效率非常低的数据结构，请使用展平数组（或 Eigen）；
schedule(dynamic)在大多数机器上效率低下，实际上不应该有用：如果代码被优化，工作应该在核心之间平衡。考虑使用schedule(static);
NUMA影响（尤其是在服务器和 AMD 处理器上）；
ETC。

正如评论中提到的，人们不应该期望达到 Numpy 的速度，因为它调用dgemmBLAS 原语，这在大多数机器上接近最佳（至少对于 OpenBLAS、BLIS 和 Intel MKL）。如果没有低级 SIMD 内在函数或汇编代码，很难获得类似的性能（大多数编译器会生成次优代码，由于寄存器分配不当而无法实现最佳性能）。

许多HPC 书籍和教程都解释了此类问题以及如何解决它们。有些具体解释了如何获得相对快速的矩阵乘法。我强烈鼓励您阅读它们。

请注意，这private(y, x, z)是无用的，因为变量是在循环中声明的。事实上，这甚至是不正确的（像 GCC 这样的编译器会打印错误）。另请注意，使用omp_set_num_threads(4);通常被认为是一种不好的做法。您应该修改环境变量OMP_NUM_THREADS。另外请注意，MSVC 支持的默认 OpenMP 版本非常旧且完全过时。应改用 Clang OpenMP 版本。事实上，据我所知 MSVC 不是一个好的优化编译器，因此应该使用 Clang/GCC（或任何 HPC 编译器，如 ICC、PGI 等）。

归档时间：	2 年，4 月前
查看次数：	226 次
最近记录：	2 年，4 月前