我想用英特尔处理器实现以下操作的最大带宽.
for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048
Run Code Online (Sandbox Code Playgroud)
其中x,y和z是浮点数组.我在Haswell,Ivy Bridge和Westmere系统上这样做.
我最初分配了这样的内存
char *a = (char*)_mm_malloc(sizeof(float)*n, 64);
char *b = (char*)_mm_malloc(sizeof(float)*n, 64);
char *c = (char*)_mm_malloc(sizeof(float)*n, 64);
float *x = (float*)a; float *y = (float*)b; float *z = (float*)c;
Run Code Online (Sandbox Code Playgroud)
当我这样做时,我获得了每个系统预期的峰值带宽的大约50%.
峰值计算为frequency * average bytes/clock_cycle.每个系统的平均字节/时钟周期为:
Core2: two 16 byte reads one 16 byte write per 2 clock cycles -> 24 bytes/clock cycle
SB/IB: two 32 byte reads and one 32 byte write per 2 clock cycles -> …Run Code Online (Sandbox Code Playgroud)