oli*_*rsm 5 c hardware intel simd vectorization
我试图弄清楚我可以将多少东西装入矢量硬件中。以支持 Intel AVX-512 的硬件为例,我可以将 8 个双精度(64 位)或 16 个单精度(32 位)放入我的向量中。但是,如果我在 64 位计算机上运行,那么我的默认指针大小很可能是 64 位。因此,如果我想取消引用指针(或者只是使用数组语法访问和数组),那么这将需要 64 位整数操作。这似乎表明,在 64 位机器上,我可以拥有的最小分区是 64 位数据类型。
然后考虑下面的 MWE,我希望编译器能看到我只处理 32 位对象(或更小)。鉴于我预计,如果我可以将向量划分为 32 位数据类型而不是使用 64 位数据类型,那么减少/计算(假设我正在做一些计算强度更大且带宽限制更少的事情)将在一半的时间内完成。
在我看来,如果我有向量寄存器并且我想要进行向量运算,那么如果我需要
n向量寄存器,其中每个寄存器都分为m-位的数据类型,那么我希望向量化的任何代码部分都不能使用数据类型比大m。(?)
微量元素
使用icc18.0.0进行编译-mkl -O2 -qopenmp -qopt-report,其中优化报告验证了 for 循环矢量化。
#include <stdlib.h>
#include <stdio.h>
#define N 1024
int main(int argc, char **argv)
{
unsigned int a[N];
for (unsigned int i = 0; i < N; i++) a[i] = i;
unsigned int z[N];
unsigned int *b = a;
printf("Sizes (Bytes)\n");
printf("Pointer = %d\n", sizeof(b));
printf("Unsigned int = %d\n", sizeof(*b));
printf("Array = %d\n\n", sizeof(a));
unsigned int sum = 0;
#pragma omp simd reduction(+:sum)
for (unsigned int i = 0; i < N; i++)
{
z[i] = 4 * a[i];
unsigned int squares = a[i] * a[i]; // Possibly some more complex sequence of operations.
sum += squares;
}
for (unsigned int i = 0; i < N; i += N/4) printf("z[%d] = %d\n", i, z[i]);
printf("\nsum = %d\n", sum);
}
Run Code Online (Sandbox Code Playgroud)
我的机器上的输出是:
Sizes (Bytes)
Pointer = 8
Unsigned int = 4
Array = 4096
z[0] = 0
z[256] = 1024
z[512] = 2048
z[768] = 3072
sum = 357389824
Run Code Online (Sandbox Code Playgroud)
\n\n\n这似乎表明,在 64 位机器上,我可以拥有的最小分区是 64 位数据类型。
\n
这个假设是错误的。
\n\n用一个(尴尬的)类比来说明,邮政地址的长度(以符号表示)与房屋的大小无关。指针的宽度与其引用的数据大小无关。
\n\n在给定类型的硬件上可以寻址的小数据块存在一个下限。它被称为字节(现代机器上的 8 位又称为八位字节,但也可以像古代机器上那样是 10 或 6 位)。一般没有更高的 bound, however. In Intel 64, as one example, XSAVE family of instructions references a memory block that is nearly 4 kbyte long, with the same 32/64 bit pointers.
\n\n\n\n\nTaking for example an Intel AVX-512 capable piece of hardware I can fit either 8 doubles (64-bit) or 16 singles (32-bit) into my vector.
\n
Or you can fit 32 half-floats (16-bit) or 64 bytes. Not sure if there are AVX-512 instructions operating on nibbles (4 bit chunks).
\n\n\n\n\nIs there a way to query the granularity of the vector partitioning that the compiler has used? (Avoiding digging through the resulting assembly).
\n
Again, the lower bound for compiler's choice is dictated by width of chosen data types in your program. If you use int, the granularity will be at least sizeof(int) bytes, if long \xe2\x80\x94 sizeof(long) etc. It is unlikely that a type wider than necessary will be used, because it would cause semantical differences of machine instructions that should be accounted for. For example, if a compiler, for unknown reasons, chooses to use a SIMD vector partitioned into uint64_t chunks to operate on a vector of uint32_t chunks, then it would have to hide differences in overflow behavior, and that would incur performance penalty.
I do not know if there are OMP pragmas to query for such information. It is unlikely, given that the same binary may have multiple code paths chosen dynamically at runtime (program's startup, so called dispatching used by Intel compiler at least), so the compile-time querying is out of the question, and I cannot see much use in runtime querying.
\n\n\n\n\n在 64 位机器上,如果假设内存地址为 64 位,那么向量如何划分为小于 64 位的数据类型?
\n
只是机器指令对相同的 SIMD 寄存器进行不同的解释。以 Intel 64 为例,有各种各样的类型(示例取自最近的 Intel 软件开发手册):
\n\n