Key*_*ame 18 c memory performance c99
disclosure: I've tried similar question on programmers.stack, but that place is nowhere near activity stack is.
Intro
I tend to work with lots of large images. They also come in sequences of more than one and have to be processed and played back repeatedly. Sometimes I use GPU, sometimes CPU, sometimes both. Most of access patterns are linear in nature (back and forth) which got me thinking about more basic things regarding arrays and how should one approach writing code optimized for maximum memory bandwidth possible on given hardware (permitting computation isn't blocking read/write).
Test specs
-pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0带有额外的include和库标志以及框架标志,以便使用我倾向于使用的glfw计时器.我可以在没有的情况下完成它,没关系.当然,所有64位.-fprefetch-loop-arrays标志进行测试,但它似乎根本没有影响结果测试
n bytes在堆上- ,其中n是8, 16, 32, 64, 128, 256, 512 and 1024 MBarray为0xff,字节线性副本:
for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
array_copy[i] = array[i];
}
Run Code Online (Sandbox Code Playgroud)
malloc在c99中uint64_t会给我内存对齐块.我还看到我的L1到L3缓存的大小,这些都高于这些320 bytes,所以我打的是什么?线索可能会在图表中稍后出现.我真的很想了解这一点.大步复制:
for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
array_copy[i] = array[i];
array_copy[i+1] = array[i+1];
array_copy[i+2] = array[i+2];
array_copy[i+3] = array[i+3];
array_copy[i+4] = array[i+4];
array_copy[i+5] = array[i+5];
array_copy[i+6] = array[i+6];
array_copy[i+7] = array[i+7];
array_copy[i+8] = array[i+8];
array_copy[i+9] = array[i+9];
array_copy[i+10] = array[i+10];
array_copy[i+11] = array[i+11];
array_copy[i+12] = array[i+12];
array_copy[i+13] = array[i+13];
array_copy[i+14] = array[i+14];
array_copy[i+15] = array[i+15];
array_copy[i+16] = array[i+16];
array_copy[i+17] = array[i+17];
array_copy[i+18] = array[i+18];
array_copy[i+19] = array[i+19];
array_copy[i+20] = array[i+20];
array_copy[i+21] = array[i+21];
array_copy[i+22] = array[i+22];
array_copy[i+23] = array[i+23];
array_copy[i+24] = array[i+24];
array_copy[i+25] = array[i+25];
array_copy[i+26] = array[i+26];
array_copy[i+27] = array[i+27];
array_copy[i+28] = array[i+28];
array_copy[i+29] = array[i+29];
array_copy[i+30] = array[i+30];
array_copy[i+31] = array[i+31];
array_copy[i+32] = array[i+32];
array_copy[i+33] = array[i+33];
array_copy[i+34] = array[i+34];
array_copy[i+35] = array[i+35];
array_copy[i+36] = array[i+36];
array_copy[i+37] = array[i+37];
array_copy[i+38] = array[i+38];
array_copy[i+39] = array[i+39];
}
Run Code Online (Sandbox Code Playgroud)
大步阅读:
const int imax = 1000;
for(int j = 0; j < imax; ++j) {
uint64_t tmp = 0;
performance = 0;
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
tmp = array[i];
tmp = array[i+1];
tmp = array[i+2];
tmp = array[i+3];
tmp = array[i+4];
tmp = array[i+5];
tmp = array[i+6];
tmp = array[i+7];
tmp = array[i+8];
tmp = array[i+9];
tmp = array[i+10];
tmp = array[i+11];
tmp = array[i+12];
tmp = array[i+13];
tmp = array[i+14];
tmp = array[i+15];
tmp = array[i+16];
tmp = array[i+17];
tmp = array[i+18];
tmp = array[i+19];
tmp = array[i+20];
tmp = array[i+21];
tmp = array[i+22];
tmp = array[i+23];
tmp = array[i+24];
tmp = array[i+25];
tmp = array[i+26];
tmp = array[i+27];
tmp = array[i+28];
tmp = array[i+29];
tmp = array[i+30];
tmp = array[i+31];
tmp = array[i+32];
tmp = array[i+33];
tmp = array[i+34];
tmp = array[i+35];
tmp = array[i+36];
tmp = array[i+37];
tmp = array[i+38];
tmp = array[i+39];
}
Run Code Online (Sandbox Code Playgroud)
-fprefetch-loop-arrays在这里没有结果.我认为这是针对这些案件的.线性阅读:
for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
tmp = array[i];
}
Run Code Online (Sandbox Code Playgroud)
memcpy作为对比.的memcpy:
memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
Run Code Online (Sandbox Code Playgroud)
结果
样本输出:
Init done in 0.767 s - size of array: 1024 MBs (x2)
Performance: 1304.325 MB/s
Copying (linear) done in 0.898 s
Performance: 1113.529 MB/s
Copying (stride 40) done in 0.257 s
Performance: 3890.608 MB/s
[1000/1000] Performance stride 40: 7474.322 MB/s
Average: 7523.427 MB/s
Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s
[1000/1000] Performance dumb: 2504.713 MB/s
Average: 2481.502 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s
Copying (memcpy) done in 1.726 s
Performance: 579.485 MB/s
--
Init done in 0.415 s - size of array: 512 MBs (x2)
Performance: 1233.136 MB/s
Copying (linear) done in 0.442 s
Performance: 1157.147 MB/s
Copying (stride 40) done in 0.116 s
Performance: 4399.606 MB/s
[1000/1000] Performance stride 40: 6527.004 MB/s
Average: 7166.458 MB/s
Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s
[1000/1000] Performance dumb: 2383.292 MB/s
Average: 2409.005 MB/s
Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s
Copying (memcpy) done in 0.102 s
Performance: 5026.476 MB/s
--
Init done in 0.228 s - size of array: 256 MBs (x2)
Performance: 1124.618 MB/s
Copying (linear) done in 0.242 s
Performance: 1057.916 MB/s
Copying (stride 40) done in 0.070 s
Performance: 3650.996 MB/s
[1000/1000] Performance stride 40: 7129.206 MB/s
Average: 7370.537 MB/s
Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s
[1000/1000] Performance dumb: 2456.129 MB/s
Average: 2435.556 MB/s
Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s
Copying (memcpy) done in 0.050 s
Performance: 5095.845 MB/s
--
Init done in 0.100 s - size of array: 128 MBs (x2)
Performance: 1277.200 MB/s
Copying (linear) done in 0.112 s
Performance: 1147.030 MB/s
Copying (stride 40) done in 0.029 s
Performance: 4424.513 MB/s
[1000/1000] Performance stride 40: 6497.635 MB/s
Average: 6714.540 MB/s
Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s
[1000/1000] Performance dumb: 2275.336 MB/s
Average: 2335.544 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s
Copying (memcpy) done in 0.025 s
Performance: 5086.502 MB/s
--
Init done in 0.051 s - size of array: 64 MBs (x2)
Performance: 1255.969 MB/s
Copying (linear) done in 0.058 s
Performance: 1104.282 MB/s
Copying (stride 40) done in 0.015 s
Performance: 4305.765 MB/s
[1000/1000] Performance stride 40: 7750.063 MB/s
Average: 7412.167 MB/s
Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s
[1000/1000] Performance dumb: 2610.136 MB/s
Average: 2577.313 MB/s
Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s
Copying (memcpy) done in 0.013 s
Performance: 4871.823 MB/s
--
Init done in 0.024 s - size of array: 32 MBs (x2)
Performance: 1306.738 MB/s
Copying (linear) done in 0.028 s
Performance: 1148.582 MB/s
Copying (stride 40) done in 0.008 s
Performance: 4265.907 MB/s
[1000/1000] Performance stride 40: 6181.040 MB/s
Average: 7124.592 MB/s
Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s
[1000/1000] Performance dumb: 2508.669 MB/s
Average: 2556.529 MB/s
Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s
Copying (memcpy) done in 0.007 s
Performance: 4617.860 MB/s
--
Init done in 0.013 s - size of array: 16 MBs (x2)
Performance: 1243.011 MB/s
Copying (linear) done in 0.014 s
Performance: 1139.362 MB/s
Copying (stride 40) done in 0.004 s
Performance: 4181.548 MB/s
[1000/1000] Performance stride 40: 6317.129 MB/s
Average: 7358.539 MB/s
Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s
[1000/1000] Performance dumb: 2529.707 MB/s
Average: 2525.783 MB/s
Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s
Copying (memcpy) done in 0.003 s
Performance: 5167.561 MB/s
--
Init done in 0.007 s - size of array: 8 MBs (x2)
Performance: 1186.019 MB/s
Copying (linear) done in 0.007 s
Performance: 1147.018 MB/s
Copying (stride 40) done in 0.002 s
Performance: 4157.658 MB/s
[1000/1000] Performance stride 40: 6958.839 MB/s
Average: 7097.742 MB/s
Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s
[1000/1000] Performance dumb: 2585.366 MB/s
Average: 2537.896 MB/s
Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s
Copying (memcpy) done in 0.002 s
Performance: 5059.164 MB/s
Run Code Online (Sandbox Code Playgroud)
10,664 MB/s这样,为什么我没有击中它?为什么阅读速度不一致,我将如何优化(缓存未命中?)?从图表中可以看出更为明显,尤其是线性读数,性能经常下降.图表
以下是感兴趣的人的完整资料来源:
/*
gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo
optional: -fprefetch-loop-arrays
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h> /* memcpy */
#include <inttypes.h>
#include <GLFW/glfw3.h>
#define ARRAY_NUM 1000000 * 128 /* GIG */
int main(int argc, char *argv[]) {
if(!glfwInit()) {
exit(EXIT_FAILURE);
}
int cx = 0;
char filename_stride[50];
char filename_dumb[50];
cx = snprintf(filename_stride, 50, "%lu_stride.dat",
((ARRAY_NUM*sizeof(uint64_t))/1000000));
if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
FILE *file_stride = fopen(filename_stride, "w");
cx = snprintf(filename_dumb, 50, "%lu_dumb.dat",
((ARRAY_NUM*sizeof(uint64_t))/1000000));
if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
FILE *file_dumb = fopen(filename_dumb, "w");
if(file_stride == NULL || file_dumb == NULL) {
perror("Error opening file.");
exit(EXIT_FAILURE);
}
uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);
double performance = 0.0;
double time_start = 0.0;
double time_end = 0.0;
double performance_min = 0.0;
double performance_max = 0.0;
/* Init array */
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
array[i] = 0xff;
}
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
printf("Init done in %.3f s - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
printf("Performance: %.3f MB/s\n\n", performance);
/* Linear copy */
performance = 0;
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
array_copy[i] = array[i];
}
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
printf("Copying (linear) done in %.3f s\n", (time_end - time_start));
printf("Performance: %.3f MB/s\n\n", performance);
/* Copying with wide stride */
performance = 0;
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
array_copy[i] = array[i];
array_copy[i+1] = array[i+1];
array_copy[i+2] = array[i+2];
array_copy[i+3] = array[i+3];
array_copy[i+4] = array[i+4];
array_copy[i+5] = array[i+5];
array_copy[i+6] = array[i+6];
array_copy[i+7] = array[i+7];
array_copy[i+8] = array[i+8];
array_copy[i+9] = array[i+9];
array_copy[i+10] = array[i+10];
array_copy[i+11] = array[i+11];
array_copy[i+12] = array[i+12];
array_copy[i+13] = array[i+13];
array_copy[i+14] = array[i+14];
array_copy[i+15] = array[i+15];
array_copy[i+16] = array[i+16];
array_copy[i+17] = array[i+17];
array_copy[i+18] = array[i+18];
array_copy[i+19] = array[i+19];
array_copy[i+20] = array[i+20];
array_copy[i+21] = array[i+21];
array_copy[i+22] = array[i+22];
array_copy[i+23] = array[i+23];
array_copy[i+24] = array[i+24];
array_copy[i+25] = array[i+25];
array_copy[i+26] = array[i+26];
array_copy[i+27] = array[i+27];
array_copy[i+28] = array[i+28];
array_copy[i+29] = array[i+29];
array_copy[i+30] = array[i+30];
array_copy[i+31] = array[i+31];
array_copy[i+32] = array[i+32];
array_copy[i+33] = array[i+33];
array_copy[i+34] = array[i+34];
array_copy[i+35] = array[i+35];
array_copy[i+36] = array[i+36];
array_copy[i+37] = array[i+37];
array_copy[i+38] = array[i+38];
array_copy[i+39] = array[i+39];
}
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
printf("Copying (stride 40) done in %.3f s\n", (time_end - time_start));
printf("Performance: %.3f MB/s\n\n", performance);
/* Reading with wide stride */
const int imax = 1000;
double performance_average = 0.0;
for(int j = 0; j < imax; ++j) {
uint64_t tmp = 0;
performance = 0;
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
tmp = array[i];
tmp = array[i+1];
tmp = array[i+2];
tmp = array[i+3];
tmp = array[i+4];
tmp = array[i+5];
tmp = array[i+6];
tmp = array[i+7];
tmp = array[i+8];
tmp = array[i+9];
tmp = array[i+10];
tmp = array[i+11];
tmp = array[i+12];
tmp = array[i+13];
tmp = array[i+14];
tmp = array[i+15];
tmp = array[i+16];
tmp = array[i+17];
tmp = array[i+18];
tmp = array[i+19];
tmp = array[i+20];
tmp = array[i+21];
tmp = array[i+22];
tmp = array[i+23];
tmp = array[i+24];
tmp = array[i+25];
tmp = array[i+26];
tmp = array[i+27];
tmp = array[i+28];
tmp = array[i+29];
tmp = array[i+30];
tmp = array[i+31];
tmp = array[i+32];
tmp = array[i+33];
tmp = array[i+34];
tmp = array[i+35];
tmp = array[i+36];
tmp = array[i+37];
tmp = array[i+38];
tmp = array[i+39];
}
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
performance_average += performance;
if(performance > performance_max) { performance_max = performance; }
if(j == 0) { performance_min = performance; }
if(performance < performance_min) { performance_min = performance; }
printf("[%d/%d] Performance stride 40: %.3f MB/s\r", j+1, imax, performance);
fprintf(file_stride, "%d\t%f\n", j, performance);
fflush(file_stride);
fflush(stdout);
}
performance_average = performance_average / imax;
printf("\nAverage: %.3f MB/s\n", performance_average);
printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n",
performance_min, performance_max);
/* Linear reading */
performance_average = 0.0;
performance_min = 0.0;
performance_max = 0.0;
for(int j = 0; j < imax; ++j) {
uint64_t tmp = 0;
performance = 0;
time_start = glfwGetTime();
for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
tmp = array[i];
}
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
performance_average += performance;
if(performance > performance_max) { performance_max = performance; }
if(j == 0) { performance_min = performance; }
if(performance < performance_min) { performance_min = performance; }
printf("[%d/%d] Performance dumb: %.3f MB/s\r", j+1, imax, performance);
fprintf(file_dumb, "%d\t%f\n", j, performance);
fflush(file_dumb);
fflush(stdout);
}
performance_average = performance_average / imax;
printf("\nAverage: %.3f MB/s\n", performance_average);
printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n",
performance_min, performance_max);
/* Memcpy */
performance = 0;
time_start = glfwGetTime();
memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
time_end = glfwGetTime();
performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
printf("Copying (memcpy) done in %.3f s\n", (time_end - time_start));
printf("Performance: %.3f MB/s\n", performance);
/* Cleanup and exit */
free(array);
free(array_copy);
glfwTerminate();
fclose(file_dumb);
fclose(file_stride);
exit(EXIT_SUCCESS);
}
Run Code Online (Sandbox Code Playgroud)
摘要
-funroll-loops to no results, so I have resorted to manually writing loop-in-loop unrolls.Thanks for the longs read.
EDIT:
It seems -O0 gives different performance from when -O flag is absent! What gives? Flag being absent yields better performance, as can be seen in the graph.
EDIT2:
I have finally hit the ceiling with AVX.
=== READING WITH AVX ===
[1000/1000] Performance AVX: 9868.912 MB/s
Average: 10029.085 MB/s
Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s
Run Code Online (Sandbox Code Playgroud)
Average being really close to 10664. I had to change compiler to clang because gcc was giving me a hard time for using avx (-mavx). This is also why graph has more pronounced dips. I would still like to know how to/what is/have constant performance. I presume this is due to caching/cache lines. It would also explain performance going above DDR3 speed here and there (MAX was 11464 MB/s).
Excuse my gnuplot-fu and its keys. Blue is SSE2 ( _mm_load_si128 ) and orange is AVX ( _mm256_load_si256 ). Purple is strided as before and green is dumb reading one at a time.
So, final two questions are:
gist with latest version: https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b and graphs from that version: http://imgur.com/a/cPeor
您从主存储器获得的峰值带宽值减少了两倍.而不是它10664 MB/s 应该是21.3 GB/s(更确切地说它应该是(21333⅓)MB/s - 请参阅下面的推导).您看到超过10664 MB/s的事实有时应该告诉您,您的峰值带宽计算可能存在问题.
为了通过Sandy Bridge获得Core2的最大带宽,您需要使用非临时存储.此外,您需要多个线程.您不需要AVX指令或展开循环.
void copy(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
}
}
Run Code Online (Sandbox Code Playgroud)
数组需要16字节对齐,也是16的倍数.非临时存储的经验法则是当您复制的内存大于最后一级缓存大小的一半时使用它们.在您的情况下,L3缓存大小的一半是1.5 MB,您复制的最小阵列是8 MB,因此这远大于最后一级缓存大小的一半.
这是一些测试它的代码.
//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>
void copy(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
}
}
void copy2(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
}
}
int main(void)
{
unsigned n = 0x7fffffff;
char *x = _mm_malloc(n, 16);
char *y = _mm_malloc(n, 16);
double dtime;
memset(x,0,n);
memset(y,1,n);
dtime = -omp_get_wtime();
copy(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
copy2(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
memcpy(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
}
Run Code Online (Sandbox Code Playgroud)
在我的系统上,Core2(在Nehalem之前)P9600 @2.53GHz,它给出了
time non temporal store 0.39
time SSE store 1.10
time memcpy 0.98
Run Code Online (Sandbox Code Playgroud)
复制2GB.
请注意,首先"触摸"要写入的内存非常重要(我使用memset执行此操作).在您访问它之前,您的系统不一定会分配内存.如果在执行内存复制时未访问内存,则执行此操作的开销会显着偏差.
据维基百科称, DDR3-1333的内存时钟为166⅔MHz.DDR以两倍的内存时钟速率传输数据.此外,DDR3的总线时钟倍频为4.因此DDR3的每存储器时钟总乘数为8.此外,您的主板有两个内存通道.所以总转移率是
21333? MB/s = (166? 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).
Run Code Online (Sandbox Code Playgroud)