有没有办法强制 Visual Studio 从 SSE 内在函数生成对齐的指令？

要么禁用优化，要么确保未启用 AVX，否则它可能会将 a 折叠_mm_load_ps到内存源操作数中vaddps xmm0, [rax]，这样不需要对齐，因为它是 AVX 版本。如果您的代码还在同一文件中使用 AVX 内部函数，这可能会出现问题，因为 clang 要求您为要使用的内部函数启用 ISA 扩展；即使使用内在函数，编译器也不会为未启用的扩展发出 asm 指令。与 MSVC 和 ICC 不同。

即使启用了 AVX，调试构建也应该可以工作，特别是如果您在单独的语句中_mm_load_ps或_mm256_load_ps进入单独的变量，而不是v=_mm_add_ps(v, _mm_load_ps(ptr));

对于 MSVC 本身，仅出于调试目的（通常对于存储来说速度损失很大），您可以用 NT 替换正常的加载/存储。由于它们很特殊，编译器不会将加载折叠到 ALU 指令的内存源操作数中，因此即使在启用优化的情况下，这也可能适用于 AVX。

// alignment_debug.h      (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
 #warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
   // SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
   // On WB, it's just a slower MOVDQA, wasting an ALU uop.
 #define _mm_load_si128  _mm_stream_load_si128
 #define _mm_load_ps(ptr)  _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
 #define _mm_load_pd(ptr)  _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))

  // SSE1/2 MOVNTPS / PD / MOVNTDQ  evict data from cache if it was hot, and bypass cache
 #define _mm_store_ps  _mm_stream_ps       // SSE1 movntps
 #define _mm_store_pd  _mm_stream_pd       // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
 #define _mm_store_si128 _mm_stream_si128  // SSE2 movntdq

// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions 
// edit welcome if anyone tests this and adds those versions
#endif

Run Code Online (Sandbox Code Playgroud)

相关：对于使用 MSVC（和 gcc/clang）进行自动矢量化，请参阅 Alex 在Alignment attribute to force adjustment load/store in auto-vectorization of GCC/CLang上的回答

归档时间：	5 年，9 月前
查看次数：	636 次
最近记录：	3 年，5 月前