小编Mar*_*ron的帖子

SIMD 2D矩阵英特尔指令集

我正在开发基于Intel指令集（AVX，FMA等）的高性能算法。当数据按顺序存储时，我的算法（内核）运行良好。但是，现在我面临一个大问题，但没有找到解决方法或解决方案：请参阅2D矩阵

int x, y; x = y = 4096;
float data[x*y]__attribute__((aligned(32)));
float buffer[y]__attribute__((aligned(32)));

/* simple test data */ 
for (i = 0; i < x; i++)
    for (j = 0; j < y; j++)
        data[y*i+j] = y*i+j; // 0,1,2,3...4095, | 4096,4097, ... 8191 |...

/* 1) Extract the columns out of matrix */
__m256i vindex; __m256 vec;
    vindex = _mm256_set_epi32(7*y, 6*y, 5*y, 4*y, 3*y, 2*y, y, 0);


 for(i = 0; i < x; i+=8)
 {
   vec = _mm256_i32gather_ps (&data[i*y], …

Run Code Online (Sandbox Code Playgroud)

c x86 simd matrix avx

Mar*_*ron

2019 01-25

2
推荐指数

1
解决办法

358
查看次数