相关疑难解决方法(0)

CPU缓存如何影响C程序的性能

我试图更多地了解 CPU 缓存如何影响性能。作为一个简单的测试,我将矩阵第一列的值与不同数量的总列数相加。

// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
    double sum = 0.0;
    for (uint64_t i = 0; i < ni; ++i) {
        sum += data[i][0];
    }
    return sum;
}

int compare(void const* _a, void const* _b)
{
    double const a = *((double*)_a);
    double …
Run Code Online (Sandbox Code Playgroud)

c performance cpu-cache

16
推荐指数
1
解决办法
1144
查看次数

是什么让 numpy.sum 比优化(自动向量化)C 循环更快?

我正在尝试编写一个与numpy.sum双精度数组一样快的 C 程序,但似乎失败了。

以下是我衡量 numpy 性能的方法:

import numpy as np
import time

SIZE=4000000
REPS=5

xs = np.random.rand(SIZE)
print(xs.dtype)

for _ in range(REPS):
    start = time.perf_counter()
    r = np.sum(xs)
    end = time.perf_counter()
    print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")
Run Code Online (Sandbox Code Playgroud)

输出是:

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)
Run Code Online (Sandbox Code Playgroud)

现在尝试在 C 中做类似的事情:

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)
Run Code Online (Sandbox Code Playgroud)

编译并gcc -o main …

c floating-point numpy avx compiler-optimization

1
推荐指数
1
解决办法
155
查看次数