浮点相等比较的SIMD指令(NaN == NaN)

Cod*_*aos 11 floating-point x86 assembly x86-64 simd

哪些指令用于比较由4*32位浮点值组成的两个128位向量?

是否存在将双方的NaN值视为相等的指令?如果不是,提供反身性的解决方案(即NaN等于NaN)的性能影响有多大?

我听说,与IEEE语义相比,确保反身性会产生显着的性能影响,因为NaN不等于自己,我想知道这种影响是否会很大.

我知道您在处理浮点值时通常需要使用epsilon比较而不是精确的质量.但是这个问题是关于完全相等的比较,例如,您可以使用它来消除哈希集中的重复值.

要求

  • +0并且-0必须相等.
  • NaN 必须与自己相等.
  • NaN的不同表示应该相等,但如果性能影响太大,可能会牺牲该要求.
  • true如果两个向量中的所有四个float元素相同,则结果应为布尔值,如果至少一个元素不同,则结果为false.其中true由标量整数1falseby表示0.

测试用例

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)
Run Code Online (Sandbox Code Playgroud)

我实现这一点的想法

我认为可以结合使用两个NotLessThan比较(CMPNLTPS?)and来实现所需的结果.汇编程序相当于AllTrue(!(x < y) and !(y < x))AllFalse((x < y) or (y > x).

背景

这个问题的背景是微软计划向.NET添加Vector类型.我正在争论一种反身.Equals方法,并且需要更清楚地了解这种反身性能对IEEE平等的性能影响有多大.请参阅应该Vector<float>.Equals是自反还是应该遵循IEEE 754语义?在程序员.长期的故事.

Pet*_*des 7

即使AVX VCMPPS可用(它的谓词选择大大增强),它的效率也低于IEEE比较.您必须至少进行两次比较并合并结果.不过,这并不算太糟糕.

  • 不同的NaN编码相等:实际上有2个额外的insn(增加2个uop).没有AVX:还有一个额外的东西movaps.

  • 不同的NaN编码相同的:有效4个额外的insn(增加4个uops).没有AVX:两个额外的movapsinsn

IEEE比较和分支是3 uops:cmpeqps/ movmskps/ test-and-branch.英特尔和AMD都将测试和分支宏观融合为单个uop/m-op.

With AVX512: bitwise-NaN is probably just one extra instruction, since normal vector compare and branch probably uses vcmpEQ_OQps/ktest same,same/jcc, so combining two different mask regs is free (just change the args to ktest). The only cost is the extra vpcmpeqd k2, xmm0,xmm1.

AVX512 any-NaN is just two extra instructions (2x VFPCLASSPS, with the 2nd one using the result of the first as a zeromask. See below). Again, then ktest with two different args to set flag.


My best idea so far: ieee_equal || bitwise_equal

If we give up on considering different NaN encodings equal to each other:

  • Bitwise equal catches two identical NaNs.
  • IEEE equal catches the +0 == -0 case.

There are no cases where either compare gives a false positive (since ieee_equal is false when either operand is NaN: we want just equal, not equal-or-unordered. AVX vcmpps provides both options, while SSE only provides a plain equal operation.)

We want to know when all elements are equal, so we should start with inverted comparisons. It's easier to check for at least one non-zero element than to check for all elements being non-zero. (i.e. horizontal AND is hard, horizontal OR is easy (pmovmskb/test, or ptest). Taking the opposite sense of a comparison is free (jnz instead of jz).) This is the same trick that Paul R used.

; inputs in xmm0, xmm1
movaps    xmm2, xmm0    ; unneeded with 3-operand AVX instructions

cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.  -1:not ieee_equal.  predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd   xmm0, xmm1    ; -1:bitwise equal  0:otherwise

; xmm0   xmm2
;   0      0   -> equal   (ieee_equal only)
;   0     -1   -> unequal (neither)
;  -1      0   -> equal   (bitwise equal and ieee_equal)
;  -1     -1   -> equal   (bitwise equal only: only happens when both are NaN)

andnps    xmm0, xmm2    ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where  (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps  eax, xmm0
test      eax, eax      ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences
Run Code Online (Sandbox Code Playgroud)

For double-precision, ...PS and pcmpeqQ will work the same.

If the not-equal code goes on to find out which element isn't equal, a bit-scan on the movmskps result will give you the position of the first difference.

With SSE4.1 PTEST you can replace andnps/movmskps/test-and-branch with:

ptest    xmm0, xmm2   ; CF =  0 == (NOT(xmm0) AND xmm2).
jc no_differences
Run Code Online (Sandbox Code Playgroud)

I expect this is the first time most people have ever seen the CF result of PTEST be useful for anything. :)

It's still three uops on Intel and AMD CPUs ( (2ptest + 1jcc) vs (pandn + movmsk + fused-test&branch)), but fewer instructions. It is more efficient if you're going to setcc or cmovcc instead of jcc, since those can't macro-fuse with test.

That makes a total of 6 uops (5 with AVX) for a reflexive compare-and-branch, vs. 3 uops for an IEEE compare-and-branch. (cmpeqps/movmskps/test-and-branch.)

PTEST has a very high latency on AMD Bulldozer-family CPUs (14c on Steamroller). They have one cluster of vector execution units shared by two integer cores. (This is their alternative to hyperthreading.) This increases the time until a branch mispredict can be detected, or the latency of a data-dependency chain (cmovcc/setcc).

PTEST设置ZFwhen 0==(xmm0 AND xmm2):如果没有元素都是bitwise_equalIEEE(neq或unordered)则设置.即如果任何元素bitwise_equal同时存在,则ZF未设置!ieee_equal.这只能在一对元素包含按位相等的NaNs时发生(但是当其他元素不相等时可能会发生).

    movaps    xmm2, xmm0
    cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.
    pcmpeqd   xmm0, xmm1    ; -1:bitwise equal

    ptest    xmm0, xmm2
    jc   equal_reflexive   ; other cases

...

equal_reflexive:
    setnz  dl               ; set if at least one both-nan element
Run Code Online (Sandbox Code Playgroud)

没有条件测试CF=1和任何关于ZF. ja测试CF=0 and ZF=1.这是不可能的,你只是想测试无论如何,所以把一个jnzjc分支目标工作正常.(如果你只想测试equal_reflexiveAND at_least_one_nan,不同的设置可能会适当地设置标志).


考虑到所有NaN都相等,即使不是按位相等:

This is the same idea as Paul R's answer, but with a bugfix (combine NaN check with IEEE check using AND rather than OR.)

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
movaps      xmm3, xmm1
cmpordps    xmm3, xmm3      ; find NaNs in B
orps        xmm2, xmm3      ; 0:A and B are both NaN.  -1:anything else

cmpneqps    xmm0, xmm1      ; 0:IEEE equal (and ordered).  -1:unequal or unordered
; xmm0 AND xmm2  is zero where elements are IEEE equal, or both NaN
; xmm0   xmm2 
;   0      0     -> equal   (ieee_equal and both NaN (impossible))
;   0     -1     -> equal   (ieee_equal)
;  -1      0     -> equal   (both NaN)
;  -1     -1     -> unequal (neither equality condition)

ptest    xmm0, xmm2        ; ZF=  0 == (xmm0 AND xmm2).  Set if no differences in any element
jz   equal_reflexive
; else at least one element was unequal

;     alternative to PTEST:  andps  xmm0, xmm2 / movmskps / test / jz
Run Code Online (Sandbox Code Playgroud)

So in this case we don't need PTEST's CF result after all. We do when using PCMPEQD, because it doesn't have an inverse (the way cmpunordps has cmpordps).

9 fused-domain uops for Intel SnB-family CPUs. (7 with AVX: use non-destructive 3-operand instructions to avoid the movaps.) However, pre-Skylake SnB-family CPUs can only run cmpps on p1, so this bottlenecks on the FP-add unit if throughput is a concern. Skylake runs cmpps on p0/p1.

andps has a shorter encoding than pand, and Intel CPUs from Nehalem to Broadwell can only run it on port5. That may be desirable to prevent it from stealing a p0 or p1 cycle from surrounding FP code. Otherwise pandn is probably a better choice. On AMD BD-family, andnps runs in the ivec domain anyway, so you don't avoid the bypass delay between int and FP vectors (which you might otherwise expect to manage if you use movmskps instead of ptest, in this version that only uses cmpps, not pcmpeqd). Also note that instruction ordering is chosen for human readability here. Putting the FP compare(A,B) earlier, before the ANDPS, might help the CPU get started on that a cycle sooner.

If one operand is reused, it should be possible to reuse its self-NaN-finding result. The new operand still needs its self-NaN check, and a compare against the reused operand, so we only save one movaps/cmpps.

If the vectors are in memory, at least one of them needs to be loaded with a separate load insn. The other one can just be referenced twice from memory. This sucks if it's unaligned or the addressing mode can't micro-fuse, but could be useful. If one of the operands to vcmpps is a vector known to not have any NaNs (e.g. a zeroed register), vcmpunord_qps xmm2, xmm15, [rsi] will find NaNs in [rsi].

If we don't want to use PTEST, we can get the same result by using the opposite comparisons, but combining them with the opposite logical operator (AND vs. OR).

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpunordps  xmm2, xmm2      ; find NaNs in A (-1:NaN  0:anything else)
movaps      xmm3, xmm1
cmpunordps  xmm3, xmm3      ; find NaNs in B
andps       xmm2, xmm3      ; xmm2 = (-1:both NaN  0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal

cmpeqps     xmm0, xmm1      ; -1:ieee_equal  0:unordered or unequal
; xmm0   xmm2 
;  -1      0     -> equal   (ieee_equal)
;  -1     -1     -> equal   (ieee_equal and both NaN (impossible))
;   0      0     -> unequal (neither)
;   0     -1     -> equal   (both NaN)

orps        xmm0, xmm2      ; 0: unequal.  -1:reflexive_equal
movmskps    eax, xmm0
test        eax, eax
jnz  equal_reflexive
Run Code Online (Sandbox Code Playgroud)

Other ideas: unfinished, non-viable, broken, or worse-than-the-above

The all-ones result of a true comparison is an encoding of NaN. (Try it out. Perhaps we can avoid using POR or PAND to combine results from cmpps on each operand separately?

; inputs in A:xmm0 B:xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed.  (This could be useful if ORPS didn't exist)

; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps  xmm2, xmm1
; A:NaN B:0   ->  0   unord 0   -> false
; A:0   B:NaN ->  NaN unord NaN -> true

; A:0   B:0   ->  NaN unord 0   -> true
; A:NaN B:NaN ->  0   unord NaN -> true

; Desired:   0 where A and B are both NaN.
Run Code Online (Sandbox Code Playgroud)

cmpordps xmm2, xmm1 just flips the final result for each case, with the "odd-man-out" still on the 1st row.

We can only get the result we want (true iff A and B are both NaN) if both inputs are inverted (NaN -> non-NaN and vice versa). This means we could use this idea for cmpordps as a replacement for pand after doing cmpordps self,self on both A and B. This isn't useful: even if we have AVX but not AVX2, we can use vandps and vandnps (and vmovmskps since vptest is AVX2 only). Bitwise booleans are only single-cycle latency, and don't tie up the vector-FP-add execution port(s) which is already a bottleneck for this code.


VFIXUPIMMPS

I spent a while with the manual grokking its operation.

It can modify a destination element if a source element is NaN, but that can't be conditional on anything about the dest element.

I was hoping I could think of a way to vcmpneqps and then fixup that result, once with each source operand (to elide the boolean instructions that combine the results of 3 vcmpps instructions). I'm now fairly sure that's impossible, because knowing that one operand is NaN isn't enough by itself make a change to the IEEE_equal(A,B) result.

I think the only way we could use vfixupimmps is for detecting NaNs in each source operand separately, like vcmpunord_qps but worse. Or as a really stupid replacement for andps, detecting either 0 or all-ones(NaN) in the mask results of previous compares.


AVX512 mask registers

Using AVX512 mask registers could help combine the results of compares. Most AVX512 compare instructions put the result into a mask register instead of a mask vector in a vector reg, so we actually have to do things this way if we want to operate in 512b chunks.

VFPCLASSPS k2 {k1}, xmm2, imm8 writes to a mask register, optionally masked by a different mask register. By setting only the QNaN and SNaN bits of the imm8, we can get a mask of where there are NaNs in a vector. By setting all the other bits, we can get the inverse.

By using the mask from A as a zero-mask for the vfpclassps on B, we can find the both-NaN positions with only 2 instructions, instead of the usual cmp/cmp/combine. So we save an or or andn instruction. Incidentally, I wonder why there's no OR-NOT operation. Probably it comes up even less often than AND-NOT, or they just didn't want porn in the instruction set.

Neither yasm nor nasm can assemble this, so I'm not even sure if I have the syntax correct!

; I think this works

;  0x81 = CLASS_QNAN|CLASS_SNAN (first and last bits of the imm8)
VFPCLASSPS    k1,     zmm0, 0x81 ; k1 = 1:NaN in A.   0:non-NaN
VFPCLASSPS    k2{k1}, zmm1, 0x81 ; k2 = 1:NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
;; so k2 is like the bitwise-equal result from pcmpeqd: it's an override for ieee_equal

vcmpNEQ_UQps  k3, zmm0, zmm1
;; k3= 0 only where IEEE equal (because of cmpneqps normal operation)

;  k2   k3   ; same logic table as the pcmpeqd bitwise-NaN version
;  0    0    ->  equal   (ieee equal)
;  0    1    ->  unequal (neither)
;  1    0    ->  equal   (ieee equal and both-NaN (impossible))
;  1    1    ->  equal   (both NaN)

;  not(k2) AND k3 is true only when the element is unequal (bitwise and ieee)

KTESTW        k2, k3    ; same as PTEST: set CF from 0 == (NOT(k2) AND k2)
jc .reflexive_equal
Run Code Online (Sandbox Code Playgroud)

We could reuse the same mask register as both zeromask and destination for the 2nd vfpclassps insn, but I used different registers in case I wanted to distinguish between them in a comment. This code needs a minimum of two mask registers, but no extra vector registers. We could also use k0 instead of k3 as the destination for vcmpps, since we don't need to use it as a predicate, only as a dest and src. (k0 is the register that can't be used as a predicate, because that encoding means instead means "no masking".)

I'm not sure we could create a single mask with the reflexive_equal result for each element, without a k... instruction to combine two masks at some point (e.g. kandnw instead of ktestw). Masks only work as zero-masks, not one-masks that can force a result to one, so combining the vfpclassps results only works as an AND. So I think we're stuck with 1-means-both-NaN, which is the wrong sense for using it as a zeromask with vcmpps. Doing vcmpps first, and then using the mask register as destination and predicate for vfpclassps, doesn't help either. Merge-masking instead of zero-masking would do the trick, but isn't available when writing to a mask register.

;;; Demonstrate that it's hard (probably impossible) to avoid using any k... instructions
vcmpneq_uqps  k1,    zmm0, zmm1   ; 0:ieee equal   1:unequal or unordered

vfpclassps    k2{k1}, zmm0, 0x81   ; 0:ieee equal or A is NaN.  1:unequal
vfpclassps    k2{k2}, zmm1, 0x81   ; 0:ieee equal | A is NaN | B is NaN.  1:unequal
;; This is just a slow way to do vcmpneq_Oqps: ordered and unequal.

vfpclassps    k3{k1}, zmm0, ~0x81  ; 0:ieee equal or A is not NaN.  1:unequal and A is NaN
vfpclassps    k3{k3}, zmm1, ~0x81  ; 0:ieee equal | A is not NaN | B is not NaN.  1:unequal & A is NaN & B is NaN
;; nope, mixes the conditions the wrong way.
;; The bits that remain set don't have any information from vcmpneqps left: both-NaN is always ieee-unequal.
Run Code Online (Sandbox Code Playgroud)

If ktest ends up being 2 uops like ptest, and can't macro-fuse, then kmov eax, k2/test-and-branch will probably be cheaper than ktest k1,k2/jcc. Hopefully it will only be one uop, since mask registers are more like integer registers, and can be designed from the start to be interally "close" to the flags. ptest was only added in SSE4.1, after many generations of designs with no interaction between vectors and EFLAGS.

kmov does set you up for popcnt, bsf or bsr, though. (bsf/jcc doesn't macro-fuse, so in a search loop you're probably still going to want to test/jcc and only bsf when a non-zero is found. The extra byte to encode tzcnt doesn't buy you anything unless you're doing something branchless, because bsf still sets ZF on a zero input, even though the dest register is undefined. lzcnt gives 32 - bsr, though, so it can be useful even when you know the input is non-zero.)

We can also use vcmpEQps and combine our results differently:

VFPCLASSPS      k1,     zmm0, 0x81 ; k1 = set where there are NaNs in A
VFPCLASSPS      k2{k1}, zmm1, 0x81 ; k2 = set where there are NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
vcmpEQ_OQps     k3, zmm0, zmm1
;; k3= 1 only where IEEE equal and ordered (cmpeqps normal operation)

;  k3   k2
;  1    0    ->  equal   (ieee equal)
;  1    1    ->  equal   (ieee equal and both-NaN (impossible))
;  0    0    ->  unequal (neither)
;  0    1    ->  equal   (both NaN)

KORTESTW        k3, k2  ; CF = set iff k3|k2 is all-ones.
jc .reflexive_equal
Run Code Online (Sandbox Code Playgroud)

This way only works when there's a size of kortest that exactly matches the number of elements in our vectors. e.g. a 256b vector of double-precision elements only has 4 elements, but kortestb still sets CF according to the low 8 bits of the input mask registers.


Using only integer ops

Other than NaN, +/-0 is the only time when IEEE_equal is different from bitwise_equal. (Unless I'm missing something. Double-check this assumption before using!) +0 and -0 have all their bits zero, except that -0 has the sign bit set (the MSB).

If we ignore different NaN encodings, then bitwise_equal is the result we want, except in the the +/- 0 case. A OR B will be 0 everywhere except the sign bit iff A and B are +/- 0. A left-shift by one makes it all-zero or not-all-zero for depending on whether or not we need to override the bitwise-equal test.

This uses one more instruction than cmpneqps, because we're emulating the functionality we need from it with por/paddD. (or pslld by one, but that's one byte longer. It does run on a different port than pcmpeq, but you need to consider the the port distribution of the surrounding code to factor that into the decision.)

This algorithm might be useful on different SIMD architectures that don't provide the same vector FP tests for detecting NaN.

;inputs in xmm0:A  xmm1:B
movaps    xmm2, xmm0
pcmpeqd   xmm2, xmm1     ; xmm2=bitwise_equal.  (0:unequal -1:equal)

por       xmm0, xmm1
paddD     xmm0, xmm0     ; left-shift by 1 (one byte shorter than pslld xmm0, 1, and can run on more ports).

; xmm0=all-zero only in the +/- 0 case (where A and B are IEEE equal)

; xmm2     xmm0          desired result (0 means "no difference found")
;  -1       0        ->      0          ; bitwise equal and +/-0 equal
;  -1     non-zero   ->      0          ; just bitwise equal
;   0       0        ->      0          ; just +/-0 equal
;   0     non-zero   ->      non-zero   ; neither

ptest     xmm2, xmm0         ; CF = ( (not(xmm2) AND xmm0) == 0)
jc  reflexive_equal
Run Code Online (Sandbox Code Playgroud)

The latency is lower than the cmpneqps version above, by one or two cycles.

We're really taking full advantage of PTEST here: Using its ANDN between two different operands, and using its compare-against-zero of the whole thing. We can't replace it with pandn / movmskps because we need to check all the bits, not just the sign bit of each element.

I haven't actually tested this, so it might be wrong even if my conclusion that +/-0 is the only time IEEE_equal is different from bitwise_equal (other than NaNs).


Handling non-bitwise-identical NaNs with integer-only ops is probably not worth it. The encoding is so similar to +/-Inf that I can't think of any simple checks that wouldn't take several instructions. Inf has all the exponent bits set, and an all-zero mantissa. NaN has all the exponent bits set, with a non-zero mantissa aka significand (so there are 23 bits of payload). The MSB of the mantissa is interpreted as an is_quiet flag to distinguish signalling/quiet NaNs. Also see Intel manual vol1, table 4-3 (Floating-Point Number and NaN Encodings).

If it wasn't for -Inf using the top-9-bits-set encoding, we could check for NaN with an unsigned compare for A > 0x7f800000. (0x7f800000 is single-precision +Inf). However, note that pcmpgtd/pcmpgtq are signed integer compares. AVX512F VPCMPUD is an unsigned compare (dest = a mask register).


The OP's idea: !(a<b) && !(b<a)

The OP's suggestion of !(a<b) && !(b<a) can't work, and neither can any variation of it. You can't tell the difference between one NaN and two NaNs just from two compares with reversed operands. Even mixing predicates can't help: No VCMPPS predicate differentiates one operand being NaN from both operands being NaN, or depends on whether it's the first or second operand that's NaN. Thus, it's impossible for a combination of them to have that information.

Paul R's solution of comparing a vector with itself does let us detect where there are NaNs and handle them "manually". No combination of results from VCMPPS between the two operands is sufficient, but using operands other than A and B does help. (Either a known-non-NaN vector or same operand twice).


Without the inversion, the bitwise-NaN code finds when at least one element is equal. (There's no inverse for pcmpeqd, so we can't use different logical operators and still get a test for all-equal):

; inputs in xmm0, xmm1
movaps   xmm2, xmm0
cmpeqps  xmm2, xmm1    ; -1:ieee_equal.  EQ_OQ predicate in the expanded notation for VEX encoding
pcmpeqd  xmm0, xmm1    ; -1:bitwise equal
orps     xmm0, xmm2
; xmm0 = -1:(where an element is bitwise or ieee equal)   0:elsewhere

movmskps eax, xmm0
test     eax, eax
jnz at_least_one_equal
; else  all different
Run Code Online (Sandbox Code Playgroud)

PTEST isn't useful this way, since combining with OR is the only useful thing.


// UNFINISHED start of an idea
bitdiff = _mm_xor_si128(A, B);
signbitdiff = _mm_srai_epi32(bitdiff, 31);   // broadcast the diff in sign bit to the whole vector
signbitdiff = _mm_srli_epi32(bitdiff, 1);    // zero the sign bit
something = _mm_and_si128(bitdiff, signbitdiff);
Run Code Online (Sandbox Code Playgroud)

  • 等等,你睡觉吗?另外,你认为我们可以在这里使用酷炫的VFIXUPIMMPS技巧吗?(好吧,"使用"就像"现在想一想"一样,感谢英特尔!) (2认同)

Pau*_*l R 3

这是一种可能的解决方案 - 但效率不是很高,需要 6 条指令:

__m128 v0, v1; // float vectors

__m128 v0nan = _mm_cmpeq_ps(v0, v0);                   // test v0 for NaNs
__m128 v1nan = _mm_cmpeq_ps(v1, v1);                   // test v1 for NaNs
__m128 vnan = _mm_or_si128(v0nan, v1nan);              // combine
__m128 vcmp = _mm_cmpneq_ps(v0, v1);                   // compare floats
vcmp = _mm_and_si128(vcmp, vnan);                      // combine NaN test
bool cmp = _mm_testz_si128(vcmp, vcmp);                // return true if all equal
Run Code Online (Sandbox Code Playgroud)

请注意,上面的所有逻辑都是相反的,这可能会使代码有点难以理解(ORs 实际上是ANDs,反之亦然)。