Optimizing linear access to arrays with pre-fetching and cache in C

Key*_*ame 18 c memory performance c99

disclosure: I've tried similar question on programmers.stack, but that place is nowhere near activity stack is.

Intro

I tend to work with lots of large images. They also come in sequences of more than one and have to be processed and played back repeatedly. Sometimes I use GPU, sometimes CPU, sometimes both. Most of access patterns are linear in nature (back and forth) which got me thinking about more basic things regarding arrays and how should one approach writing code optimized for maximum memory bandwidth possible on given hardware (permitting computation isn't blocking read/write).

Test specs

  • I've done this on a 2011 MacbookAir4,2 (I5-2557M) with 4GB RAM and SSD. Nothing else was running during tests except iterm2.
  • 带有标志的gcc 5.2.0(自制软件):-pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0带有额外的include和库标志以及框架标志,以便使用我倾向于使用的glfw计时器.我可以在没有的情况下完成它,没关系.当然,所有64位.
  • 我已尝试使用可选-fprefetch-loop-arrays标志进行测试,但它似乎根本没有影响结果

测试

  • 分配的两个阵列n bytes在堆上- ,其中n8, 16, 32, 64, 128, 256, 512 and 1024 MB
  • 一次初始化array0xff,字节
  • 测试1 - 线性副本

线性副本:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
Run Code Online (Sandbox Code Playgroud)
  • 测试2 - 大步复制.这是令人困惑的地方.我试过在这里玩预取游戏.我已经尝试了每个循环应该做多少的各种组合,似乎每循环~40可以产生最佳性能.为什么?我不知道.我明白malloc在c99中uint64_t会给我内存对齐块.我还看到我的L1到L3缓存的大小,这些都高于这些320 bytes,所以我打的是什么?线索可能会在图表中稍后出现.我真的很想了解这一点.

大步复制:

for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
Run Code Online (Sandbox Code Playgroud)
  • 测试3 - 大步读书.与使用步幅复制相同.

大步阅读:

    const int imax = 1000;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
Run Code Online (Sandbox Code Playgroud)
  • 测试4 - 线性读数.每字节字节数.我很惊讶-fprefetch-loop-arrays在这里没有结果.我认为这是针对这些案件的.

线性阅读:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
Run Code Online (Sandbox Code Playgroud)
  • 测试5 - memcpy作为对比.

的memcpy:

memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
Run Code Online (Sandbox Code Playgroud)

结果

  • 样本输出:

样本输出:

Init done in 0.767 s - size of array: 1024 MBs (x2)
Performance: 1304.325 MB/s

Copying (linear) done in 0.898 s
Performance: 1113.529 MB/s

Copying (stride 40) done in 0.257 s
Performance: 3890.608 MB/s

[1000/1000] Performance stride 40: 7474.322 MB/s
Average: 7523.427 MB/s
Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s

[1000/1000] Performance dumb: 2504.713 MB/s
Average: 2481.502 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s

Copying (memcpy) done in 1.726 s
Performance: 579.485 MB/s

--

Init done in 0.415 s - size of array: 512 MBs (x2)
Performance: 1233.136 MB/s

Copying (linear) done in 0.442 s
Performance: 1157.147 MB/s

Copying (stride 40) done in 0.116 s
Performance: 4399.606 MB/s

[1000/1000] Performance stride 40: 6527.004 MB/s
Average: 7166.458 MB/s
Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s

[1000/1000] Performance dumb: 2383.292 MB/s
Average: 2409.005 MB/s
Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s

Copying (memcpy) done in 0.102 s
Performance: 5026.476 MB/s

--

Init done in 0.228 s - size of array: 256 MBs (x2)
Performance: 1124.618 MB/s

Copying (linear) done in 0.242 s
Performance: 1057.916 MB/s

Copying (stride 40) done in 0.070 s
Performance: 3650.996 MB/s

[1000/1000] Performance stride 40: 7129.206 MB/s
Average: 7370.537 MB/s
Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s

[1000/1000] Performance dumb: 2456.129 MB/s
Average: 2435.556 MB/s
Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s

Copying (memcpy) done in 0.050 s
Performance: 5095.845 MB/s

-- 

Init done in 0.100 s - size of array: 128 MBs (x2)
Performance: 1277.200 MB/s

Copying (linear) done in 0.112 s
Performance: 1147.030 MB/s

Copying (stride 40) done in 0.029 s
Performance: 4424.513 MB/s

[1000/1000] Performance stride 40: 6497.635 MB/s
Average: 6714.540 MB/s
Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s

[1000/1000] Performance dumb: 2275.336 MB/s
Average: 2335.544 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s

Copying (memcpy) done in 0.025 s
Performance: 5086.502 MB/s

-- 

Init done in 0.051 s - size of array: 64 MBs (x2)
Performance: 1255.969 MB/s

Copying (linear) done in 0.058 s
Performance: 1104.282 MB/s

Copying (stride 40) done in 0.015 s
Performance: 4305.765 MB/s

[1000/1000] Performance stride 40: 7750.063 MB/s
Average: 7412.167 MB/s
Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s

[1000/1000] Performance dumb: 2610.136 MB/s
Average: 2577.313 MB/s
Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s

Copying (memcpy) done in 0.013 s
Performance: 4871.823 MB/s

-- 

Init done in 0.024 s - size of array: 32 MBs (x2)
Performance: 1306.738 MB/s

Copying (linear) done in 0.028 s
Performance: 1148.582 MB/s

Copying (stride 40) done in 0.008 s
Performance: 4265.907 MB/s

[1000/1000] Performance stride 40: 6181.040 MB/s
Average: 7124.592 MB/s
Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s

[1000/1000] Performance dumb: 2508.669 MB/s
Average: 2556.529 MB/s
Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s

Copying (memcpy) done in 0.007 s
Performance: 4617.860 MB/s

--

Init done in 0.013 s - size of array: 16 MBs (x2)
Performance: 1243.011 MB/s

Copying (linear) done in 0.014 s
Performance: 1139.362 MB/s

Copying (stride 40) done in 0.004 s
Performance: 4181.548 MB/s

[1000/1000] Performance stride 40: 6317.129 MB/s
Average: 7358.539 MB/s
Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s

[1000/1000] Performance dumb: 2529.707 MB/s
Average: 2525.783 MB/s
Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s

Copying (memcpy) done in 0.003 s
Performance: 5167.561 MB/s

--

Init done in 0.007 s - size of array: 8 MBs (x2)
Performance: 1186.019 MB/s

Copying (linear) done in 0.007 s
Performance: 1147.018 MB/s

Copying (stride 40) done in 0.002 s
Performance: 4157.658 MB/s

[1000/1000] Performance stride 40: 6958.839 MB/s
Average: 7097.742 MB/s
Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s

[1000/1000] Performance dumb: 2585.366 MB/s
Average: 2537.896 MB/s
Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s

Copying (memcpy) done in 0.002 s
Performance: 5059.164 MB/s
Run Code Online (Sandbox Code Playgroud)
  • 线性读数比步幅读数慢3倍.步幅读数达到最大值.7500-7800 MB/s范围.但有两件事令我困惑.在DDR3 1333 Mhz,最大内存吞吐量应该是10,664 MB/s这样,为什么我没有击中它?为什么阅读速度不一致,我将如何优化(缓存未命中?)?从图表中可以看出更为明显,尤其是线性读数,性能经常下降.

图表

8-16 MB 8-16 MB

32-64 MB 32-64 MB

128-256 MB 128-256 MB

512-1024 MB 512-1024 MB

全部一起 ALL

以下是感兴趣的人的完整资料来源:

/*
gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo

optional: -fprefetch-loop-arrays
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h> /* memcpy */
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 128 /* GIG */
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    int cx = 0;
    char filename_stride[50];
    char filename_dumb[50];
    cx = snprintf(filename_stride, 50, "%lu_stride.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_stride = fopen(filename_stride, "w");
    cx = snprintf(filename_dumb, 50, "%lu_dumb.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_dumb   = fopen(filename_dumb, "w");
    if(file_stride == NULL || file_dumb == NULL) {
        perror("Error opening file.");
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double performance  = 0.0;
    double time_start   = 0.0;
    double time_end     = 0.0;
    double performance_min  = 0.0;
    double performance_max  = 0.0;

    /* Init array */
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %.3f s - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Linear copy */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Copying with wide stride */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 40) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Reading with wide stride */
    const int imax = 1000;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }

        printf("[%d/%d] Performance stride 40: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_stride, "%d\t%f\n", j, performance);
        fflush(file_stride);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Linear reading */
    performance_average = 0.0;
    performance_min     = 0.0;
    performance_max      = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }
        printf("[%d/%d] Performance dumb: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_dumb, "%d\t%f\n", j, performance);
        fflush(file_dumb);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Memcpy */
    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n", performance);

    /* Cleanup and exit */
    free(array);
    free(array_copy);
    glfwTerminate();
    fclose(file_dumb);
    fclose(file_stride);

    exit(EXIT_SUCCESS);
}
Run Code Online (Sandbox Code Playgroud)

摘要

  • 在使用线性访问是最常见模式的数组时,我应该如何编写代码以获得最大和(接近)恒定速度?
  • 我可以从这个例子中了解缓存和预取的哪些内容?
  • Are these graphs telling me something I should know that I haven't noticed?
  • How else can I unroll loops? I have tried -funroll-loops to no results, so I have resorted to manually writing loop-in-loop unrolls.

Thanks for the longs read.

EDIT:

It seems -O0 gives different performance from when -O flag is absent! What gives? Flag being absent yields better performance, as can be seen in the graph.

O flag absent

EDIT2:

I have finally hit the ceiling with AVX.

=== READING WITH AVX ===
[1000/1000] Performance AVX: 9868.912 MB/s
Average: 10029.085 MB/s
Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s
Run Code Online (Sandbox Code Playgroud)

Average being really close to 10664. I had to change compiler to clang because gcc was giving me a hard time for using avx (-mavx). This is also why graph has more pronounced dips. I would still like to know how to/what is/have constant performance. I presume this is due to caching/cache lines. It would also explain performance going above DDR3 speed here and there (MAX was 11464 MB/s).

Excuse my gnuplot-fu and its keys. Blue is SSE2 ( _mm_load_si128 ) and orange is AVX ( _mm256_load_si256 ). Purple is strided as before and green is dumb reading one at a time.

King AVX

So, final two questions are:

  • What is causing dips and how to have more constant performance
  • Is it possible to hit ceiling without intrinsics?

gist with latest version: https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b and graphs from that version: http://imgur.com/a/cPeor

Z b*_*son 7

您从主存储器获得的峰值带宽值减少了两倍.而不是它10664 MB/s 应该是21.3 GB/s(更确切地说它应该是(21333⅓)MB/s - 请参阅下面的推导).您看到超过10664 MB/s的事实有时应该告诉您,您的峰值带宽计算可能存在问题.

为了通过Sandy Bridge获得Core2的最大带宽,您需要使用非临时存储.此外,您需要多个线程.您不需要AVX指令或展开循环.

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}
Run Code Online (Sandbox Code Playgroud)

数组需要16字节对齐,也是16的倍数.非临时存储的经验法则是当您复制的内存大于最后一级缓存大小的一半时使用它们.在您的情况下,L3缓存大小的一半是1.5 MB,您复制的最小阵列是8 MB,因此这远大于最后一级缓存大小的一半.

这是一些测试它的代码.

//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

void copy2(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

int main(void)
{
    unsigned n = 0x7fffffff;
    char *x = _mm_malloc(n, 16);
    char *y = _mm_malloc(n, 16);
    double dtime;

    memset(x,0,n);
    memset(y,1,n);

    dtime = -omp_get_wtime();
    copy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    copy2(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    memcpy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);  
}
Run Code Online (Sandbox Code Playgroud)

在我的系统上,Core2(在Nehalem之前)P9600 @2.53GHz,它给出了

time non temporal store 0.39
time SSE store          1.10
time memcpy             0.98
Run Code Online (Sandbox Code Playgroud)

复制2GB.

请注意,首先"触摸"要写入的内存非常重要(我使用memset执行此操作).在您访问它之前,您的系统不一定会分配内存.如果在执行内存复制时未访问内存,则执行此操作的开销会显着偏差.


据维基百科称, DDR3-1333的内存时钟为166⅔MHz.DDR以两倍的内存时钟速率传输数据.此外,DDR3的总线时钟倍频为4.因此DDR3的每存储器时钟总乘数为8.此外,您的主板有两个内存通道.所以总转移率是

 21333? MB/s = (166? 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).
Run Code Online (Sandbox Code Playgroud)