我试图使用ARM Neon指令移植一些SSE2代码(快速角点检测器得分计算).乍看之下代码非常简单,但由于某些原因结果不同.问题是,有时差异可能非常大,有时会相差2或3个值.如果有人能解释为什么会发生这种情况会很棒.
这是代码
原SSE2:
__m128i q0 = _mm_set1_epi16(-1000), q1 = _mm_set1_epi16(1000);
for( k = 0; k < 16; k += 8 )
{
__m128i v0 = _mm_loadu_si128((__m128i*)(d+k+1));
__m128i v1 = _mm_loadu_si128((__m128i*)(d+k+2));
__m128i a = _mm_min_epi16(v0, v1);
__m128i b = _mm_max_epi16(v0, v1);
v0 = _mm_loadu_si128((__m128i*)(d+k+3));
a = _mm_min_epi16(a, v0);
b = _mm_max_epi16(b, v0);
v0 = _mm_loadu_si128((__m128i*)(d+k+4));
a = _mm_min_epi16(a, v0);
b = _mm_max_epi16(b, v0);
v0 = _mm_loadu_si128((__m128i*)(d+k+5));
a = _mm_min_epi16(a, v0);
b = _mm_max_epi16(b, v0);
v0 = _mm_loadu_si128((__m128i*)(d+k+6));
a = _mm_min_epi16(a, …Run Code Online (Sandbox Code Playgroud) 我决定继续快速角落优化并坚持
_mm_movemask_epi8SSE指令.如何通过uint8x16_t输入为ARM Neon重写它?