奇怪的uint32_t浮点数组转换

Question

奇怪的uint32_t浮点数组转换

Sen*_*yai 9 c++ sse vectorization visual-studio

我有以下代码片段:

#include <cstdio>
#include <cstdint>

static const size_t ARR_SIZE = 129;

int main()
{
  uint32_t value = 2570980487;

  uint32_t arr[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
    arr[x] = value;

  float arr_dst[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
  {
    arr_dst[x] = static_cast<float>(arr[x]);
  }

  printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");

  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
  return 0;
}

Run Code Online (Sandbox Code Playgroud)

如果我在MS Visual Studio 2015下编译它,我可以看到输出是:

WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000

Run Code Online (Sandbox Code Playgroud)

所以最后一个arr_dst元素与前一个元素不同,但这两个值是通过转换相同的值来获得的,这个值填充了arr数组!这是一个错误吗？

我注意到如果我以下面的方式修改转换循环,我得到"OK"结果:

for (int x = 0; x < ARR_SIZE; ++x)
{
  if (x == 0)
    x = 0;
  arr_dst[x] = static_cast<float>(arr[x]);
}

Run Code Online (Sandbox Code Playgroud)

所以这可能是矢量化优化的一些问题.

此行为不会在gcc 4.8上重现.有任何想法吗？

Answer 1

Joh*_*ger 5

32位IEEE-754二进制浮点数(如MSVC++使用)仅提供6-7个十进制数字的精度.您的起始值完全在该类型的范围内,但似乎不能完全表示该类型,因为大多数类型的值都是如此uint32_t.

与此同时,x86或x86_64处理器的浮点单元使用比MSVC++的64位更宽的表示double.似乎在循环退出后,最后计算的数组元素以其扩展精度形式保留在FPU寄存器中.然后,程序可以直接从寄存器中使用该值,而不是从存储器中读取它,这有必要对先前的元素进行读取.

如果程序==通过将较窄的表示推广到更宽的而不是相反的方式来执行比较,那么这两个值可能确实比较不相等,因为从扩展精度到float后退的往返失去了精度.无论如何,这两个值double在传递给时都会转换为类型printf(); 如果他们确实比较了不平等,那么这些转换的结果也可能不同.

我没有使用MSVC++编译选项,但很可能有一个可以解决这种行为.这些选项有时会使用诸如"严格数学"或"严格fp"之类的名称.但请注意,在FP重型程序中打开这样的选项(或关闭其相反的选项)可能会非常昂贵.

Answer 2

Pet*_*des 5

unsigned在 x86 上，和之间的转换float并不简单；没有单独的指令（直到 AVX512）。一种常见的技术是转换为有符号的，然后修复结果。有多种方法可以做到这一点。（请参阅此问答，了解一些使用 C 内在函数的手动向量化方法，并非所有方法都有完美的结果。）

MSVC 使用一种策略对前 128 个进行向量化，然后对最后一个标量元素使用不同的策略（不会向量化），这涉及到到的转换double，然后从double到的转换float。

gcc 和 clang2570980608.0通过其矢量化和标量方法生成结果。 2570980608 - 2570980487 = 121，并且2570980487 - 2570980352 = 135（没有输入/输出的舍入），因此 gcc 和 clang 在这种情况下会产生正确的舍入结果（误差小于 0.5ulp）。我不知道是否对于每个可能的 uint32_t 都是如此（但只有 2^32 个，我们可以详尽地检查）。MSVC 的矢量化循环的最终结果具有略大于 0.5ulp 的误差，但标量方法对此输入进行了正确舍入。

IEEE 数学要求+ - * /并sqrt生成正确舍入的结果（误差小于 0.5ulp），但其他函数（如log）没有如此严格的要求。我不知道 int->float 转换的舍入要求是什么，所以我不知道 MSVC 所做的是否是严格合法的（如果您没有使用/fp:fast或任何东西）。

另请参阅 Bruce Dawson 的浮点确定性博客文章（他关于 FP 数学的优秀系列的一部分），尽管他没有提到整数<->FP 转换。

我们可以在OP链接的asm中看到MSVC做了什么（精简为仅有趣的指令并手动注释）：

; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040                   ; size = 516
_arr$ = -520                        ; size = 516
_main   PROC                        ; COMDAT

  00013      mov     edx, 129
  00018      mov     eax, -1723986809   ; this is your unsigned 2570980487
  0001d      mov     ecx, edx
  00023      lea     edi, DWORD PTR _arr$[esp+1088]  ; edi=arr
  0002a      rep stosd             ; memset in chunks of 4B
  # arr[0..128] = 2570980487 at this point

  0002c      xor     ecx, ecx      ; i = 0
  # xmm2 = 0.0 in each element (i.e. all-zero)
  # xmm3 = __xmm@4f8000004f8000004f8000004f800000  (a constant repeated in each of 4 float elements)


  ####### The vectorized unsigned->float conversion strategy:
  $LL7@main:                                       ; do{
  00030      movups  xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088]  ; load 4 uint32_t
  00038      cvtdq2ps xmm1, xmm0                 ; SIGNED int to Single-precision float
  0003b      movaps  xmm0, xmm1
  0003e      cmpltps xmm0, xmm2                  ; xmm0 = (xmm0 < 0.0)
  00042      andps   xmm0, xmm3                  ; mask the magic constant
  00045      addps   xmm0, xmm1                  ; x += (x<0.0) ? magic_constant : 0.0f;
   # There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
  00048      movups  XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
  ; and repeat the same thing again, with addresses that are 16B higher (+1104)
  ; i.e. this loop is unrolled by two

  0006a      add     ecx, 8         ;  i+=8 (two vectors of 4 elements)
  0006d      cmp     ecx, 128
  00073      jb  SHORT $LL7@main    ; }while(i<128)

 #### End of vectorized loop
 # and then IDK what MSVC smoking; both these values are known at compile time.  Is /Ogtp not full optimization?
 # I don't see a branch target that would let execution reach this code
 #  other than by falling out of the loop that ends with ecx=128
  00075      cmp     ecx, edx
  00077      jae     $LN21@main     ; if(i>=129): always false

  0007d      sub     edx, ecx       ; edx = 129-128 = 1

Run Code Online (Sandbox Code Playgroud)

...一些更荒谬的已知编译时跳转稍后...

 ######## The scalar unsigned->float conversion strategy for the last element
$LC15@main:
  00140      mov     eax, DWORD PTR _arr$[esp+ecx*4+1088]
  00147      movd    xmm0, eax
  # eax = xmm0[0] = arr[128]
  0014b      cvtdq2pd xmm0, xmm0        ; convert the last element TO DOUBLE
  0014f      shr     eax, 31            ; shift the sign bit to bit 1, so eax = 0 or 1
     ; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
  00152      addsd   xmm0, QWORD PTR __xmm@41f00000000000000000000000000000[eax*8]
  0015b      cvtpd2ps xmm0, xmm0        ; double -> float
  0015f      movss   DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0  ; and store it

  00165      inc     ecx            ;   ++i;
  00166      cmp     ecx, 129       ; } while(i<129)
  0016c      jb  SHORT $LC15@main
  # Yes, this is a loop, which always runs exactly once for the last element

Run Code Online (Sandbox Code Playgroud)

相比之下，clang 和 gcc 也不会在编译时优化整个过程，但它们确实意识到它们不需要清理循环，而只需在各自的循环之后执行单个标量存储或转换。（clang 实际上会完全展开所有内容，除非你告诉它不要这样做。）

请参阅Godbolt 编译器资源管理器上的代码。

gcc 只是将上半部分和下半部分 16b 分别转换为浮点数，然后将它们与乘以 65536 并相加相结合。

Clang 的unsigned->float转换策略很有趣：它cvt根本不使用指令。我认为它将无符号整数的两个 16 位半数直接填充到两个浮点数的尾数中（使用一些技巧来设置指数（按位布尔值和 ADDPS），然后像 gcc 一样将低半部分和高半部分加在一起。

当然，如果编译为 64 位代码，标量转换只需将 0 扩展uint32_t为 64 位，并将其作为有符号 int64_t 转换为 float。有符号 int64_t 可以表示 uint32_t 的每个值，x86 可以有效地将 64 位有符号 int 转换为浮点数。但这并没有矢量化。

Answer 3

And*_*eas 2

我对 PowerPC 实现（Freescale MCP7450）进行了调查，因为恕我直言，它们的记录比英特尔提出的任何巫术都要好得多。

事实证明，浮点单元、FPU 和向量单元对于浮点运算可能有不同的舍入。FPU 可配置为使用四种舍入模式之一；舍入到最接近的值（默认）、截断、朝正无穷大和朝负无穷大。然而，向量单元只能舍入到最接近的值，并且一些选择指令具有特定的舍入规则。FPU的内部精度为106位。向量单元满足 IEEE-754，但文档没有说明更多内容。

查看结果，转换 2570980608 更接近原始整数，这表明 FPU 比向量单元或不同的舍入模式具有更好的内部精度。

归档时间：	9 年，9 月前
查看次数：	554 次
最近记录：	9 年，8 月前