小编Mar*_*tin的帖子

如何水平3×3添加AVX2矢量？

我有一个__m256i包含16x16位元素的向量.我想在其上应用三个相邻的水平加法.在标量模式下,我使用以下代码:

unsigned short int temp[16];
__m256i sum_v;//has some values. 16 elements of 16-bit vector.   | 0 | x15 | x14 | x13 | ... | x3 | x2 | x1 |
_mm256_store_si256((__m256i *)&temp[0], sum_v);
output1 = (temp[0] + temp[1] + temp[2]);
output2 = (temp[3] + temp[4] + temp[5]);
output3 = (temp[6] + temp[7] + temp[8]);
output4 = (temp[9] + temp[10] + temp[11]);
output5 = (temp[12] + temp[13] + temp[14]); 
// Dont want the 15th element

Run Code Online (Sandbox Code Playgroud)

因为这部分放在我程序的瓶颈部分,所以我决定使用AVX2进行矢量化.梦幻我可以添加它们像下面的伪:

sum_v                                     //| …

Run Code Online (Sandbox Code Playgroud)

x86 simd intrinsics avx2

Mar*_*tin

lucky-day

4
推荐指数

2
解决办法

361
查看次数

显式多线程 SIMD 操作的最快方法是什么？

使用内在函数是 SIMDizing 的常用方法。例如，我可以对 8 个整数执行单个加法指令_mm256_add_epi32。需要二_mm256_load_si256加一加_mm256_store_si256后如下：

__m256i vec1 = _mm256_load_si256((__m256i *)&A[0]); // almost 5 cycles
__m256i vec2 = _mm256_load_si256((__m256i *)&B[0]); // almost 5 cycles
__m256i vec3 = _mm256_add_epi32( vec1 , vec2); // almost 1 cycle
_mm256_store_si256((__m256i *)&C[0], vec3); // almost 5

Run Code Online (Sandbox Code Playgroud)

它在 CPU 的单核上执行指令。我的酷睿 i7 有 8 核（4 核）；我想像这样将操作发送到所有内核：

int i_0, i_1, i_2, i_3, i_4, i_5, i_6, i_7 ; // These specify the values in memory
//core 0
__m256i vec1_0 = _mm256_load_si256((__m256i *)&A[i_0]); …

Run Code Online (Sandbox Code Playgroud)

c x86 multithreading simd intrinsics

Mar*_*tin

2017 05-30

4
推荐指数

1
解决办法

1577
查看次数

数量Intel SSE4.2恰好是313个汇编指令集(英特尔手册编号的总和).我想要相同的号码AVX,AVX2但找不到任何可信的参考.我找到了一个引用,告诉它AVX(第1页,表1)中有292条指令,并且它的错误和SSE4.2包含SSSE3它们没有计算它.那我怎么算上AVX/AVX2指示呢？(我想编写一个程序并将英特尔内在函数指南复制到文本文件中并进行处理.但我需要一种更简单的方法

x86 assembly avx avx2

Mar*_*tin

lucky-day

3
推荐指数

1
解决办法

597
查看次数

为什么这个SSE2程序(整数)生成movaps(float)？

以下循环将整数矩阵转置为另一个整数矩阵.当我有趣地编译时,它生成movaps指令以将结果存储到输出矩阵中.为什么gcc这样？

数据:

int __attribute__(( aligned(16))) t[N][M]  
  , __attribute__(( aligned(16))) c_tra[N][M];

Run Code Online (Sandbox Code Playgroud)

循环:

for( i=0; i<N; i+=4){
    for(j=0; j<M; j+=4){

        row0 = _mm_load_si128((__m128i *)&t[i][j]);
        row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
        row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
        row3 = _mm_load_si128((__m128i *)&t[i+3][j]);

        __t0 = _mm_unpacklo_epi32(row0, row1);
        __t1 = _mm_unpacklo_epi32(row2, row3);
        __t2 = _mm_unpackhi_epi32(row0, row1);
        __t3 = _mm_unpackhi_epi32(row2, row3);

        /* values back into I[0-3] */
        row0 = _mm_unpacklo_epi64(__t0, __t1);
        row1 = _mm_unpackhi_epi64(__t0, __t1);
        row2 = _mm_unpacklo_epi64(__t2, __t3);
        row3 = _mm_unpackhi_epi64(__t2, __t3);

        _mm_store_si128((__m128i *)&c_tra[j][i], row0);
        _mm_store_si128((__m128i …

Run Code Online (Sandbox Code Playgroud)

x86 assembly gcc sse simd

Mar*_*tin

lucky-day

3
推荐指数

1
解决办法

609
查看次数

如何用 SIMD 指令交换值？

我想在两个 256 位向量 A 和 B 中交换 16 位值。梦幻般的方法是找到一个内部指令来做到这一点。不幸的是，我找不到并且我认为没有针对此工作的说明。shuffle、permute、blend 等指令保留或销毁目的地中的值。我正在寻找的内容如下：

vector A : |a0|a1|a2|a3|a4|a5|a6|a7||a8|a9|a10|a11|a12|a13|a14|a15|
Vector B : |b0|b1|b2|b3|b4|b5|b6|b7||b8|b9|b10|b11|b12|b13|b14|b15|
//After swapping
Vector A : |a0|a1|b2|a3|a4|b5|a6|a7||b8|a9|a10|b11|a12|a13|b14|a15|
Vector B : |b0|b1|a2|b3|b4|a5|b6|b7||a8|b9|b10|a11|b12|b13|a14|b15|

Run Code Online (Sandbox Code Playgroud)

所以问题是：当有很多 shuffle 指令时，交换两个向量的最快方法是什么？

我已经实施了以下程序：

vector A : |a0|a1|a2|a3|a4|a5|a6|a7||a8|a9|a10|a11|a12|a13|a14|a15|
Vector B : |b0|b1|b2|b3|b4|b5|b6|b7||b8|b9|b10|b11|b12|b13|b14|b15|
//After swapping
Vector A : |a0|a1|b2|a3|a4|b5|a6|a7||b8|a9|a10|b11|a12|a13|b14|a15|
Vector B : |b0|b1|a2|b3|b4|a5|b6|b7||a8|b9|b10|a11|b12|b13|a14|b15|

Run Code Online (Sandbox Code Playgroud)

输出在这里：

original     a : [0]= 1, [1]= 2, [2]= 3, [3]= 4, [4]= 5, [5]= 6, [6]= 7, [7]= 8,... [8]= 9, [9]=10, [10]=11, [11]=12, [12]=13, [13]=14, [14]=15, [15]=16 

original …

Run Code Online (Sandbox Code Playgroud)

x86 simd vectorization intrinsics avx2

Mar*_*tin

lucky-day

3
推荐指数

1
解决办法

1025
查看次数

What is the reason for different performance of the same implementation using icc, gcc and clang?

I have implemented a program for a[i]=a[i-1]+c and I represent it her. I use begin_rdtsc and end_rdtsc to read and store the rdtsc to measure the speedup.

The program is as follows, I use x86intrin.h

#define MAX1 512
#define LEN MAX1*MAX1  //array size for time measure ments
int __attribute__(( aligned(32))) a[LEN];

int main(){

    singleCore // It's a macro to assign the program to a single core of the processor
    int i, b, c;

    begin_rdtsc

    // b=1 and c=2 in this …

Run Code Online (Sandbox Code Playgroud)

x86 assembly gcc simd icc

Mar*_*tin

2017 12-31

3
推荐指数

1
解决办法

280
查看次数

如何使C编译器将所有嵌套循环转换为单个循环

假设有四个嵌套循环,具有不同的循环计数器和条件.有没有办法告诉编译器(icc,gcc和clang)将所有循环转换为一个循环？

N=128; M=128; P=3; Q=3; //All these variables are constant
for (n=0; n<N; n++){
    for(m=0; m<M; m++){
        temp=0;
        for(p=0; p<P; p++){ 
            for(q=0; q<Q; q++){
                temp += kernel[p][q] * input[n+p][m+q];
            }
        }
        output[n][m]=temp;
    }
}

Run Code Online (Sandbox Code Playgroud)

要转变为: