我正在尝试使用256位向量(Intel intrinsics - AVX)来提高代码的性能.
我有一个支持SSE1到SSE4.2和AVX/AVX2扩展的I7 Gen.4(Haswell架构)处理器.
这是我正在尝试增强的代码片段:
/* code snipet */
kfac1 = kfac + factor; /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;
k1fac1 = k1fac + factor1; /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1; …Run Code Online (Sandbox Code Playgroud) 我有一个已计算的4个浮点数的128位向量,我想改变这个向量的顺序,如下所示:
Vector A before reordering
+---+---+---+---+
| a | b | c | d |
+---+---+---+---+
Vector A after reordering
+---+---+---+---+
| b | a | c | d |
+---+---+---+---+
Run Code Online (Sandbox Code Playgroud)
正如我所说,矢量已经通过早期计算计算出来,所以没有办法使用_mm_set_ps()......任何人都知道如何做到这一点?
我想像这样动态创建一个2D数组:
+------+------+
| i | j |
+------+------+ // 2 cols and N rows (N unknown)
| 2 | 2048|
+------+------+
| 3 | 3072|
+------+------+
| 5 | 256|
+------+------+
| ... | ....|
+------+------+
Run Code Online (Sandbox Code Playgroud)
这是一个伪代码,我将如何填充数组:
int N = 4096;
void foo(int N)
{
for (i =0;i<N;i++)
{
int j = index_computation(i);
if(j>i)
{
//alocate array row
A[i][0] = i;
A[i][1] = j;
}
}
}
Run Code Online (Sandbox Code Playgroud)
我对如何动态分配它有点困惑.