我很好奇有多少种方法可以在x86汇编中将寄存器设置为零.使用一条指令.有人告诉我,他设法找到了至少10种方法.
我能想到的是:
xor ax,ax
mov ax, 0
and ax, 0
Run Code Online (Sandbox Code Playgroud)
GJ.*_*GJ. 13
在IA32下如何将0移入ax中有很多可能性...
lea eax, [0]
mov eax, 0FFFF0000h //All constants form 0..0FFFFh << 16
shr eax, 16 //All constants form 16..31
shl eax, 16 //All constants form 16..31
Run Code Online (Sandbox Code Playgroud)
也许最奇怪的...... :)
@movzx:
movzx eax, byte ptr[@movzx + 6] //Because the last byte of this instruction is 0
Run Code Online (Sandbox Code Playgroud)
和...
@movzx:
movzx ax, byte ptr[@movzx + 7]
Run Code Online (Sandbox Code Playgroud)
编辑:
对于16位x86 cpu模式,未经测试......:
lea ax, [0]
Run Code Online (Sandbox Code Playgroud)
和...
@movzx:
movzx ax, byte ptr cs:[@movzx + 7] //Check if 7 is right offset
Run Code Online (Sandbox Code Playgroud)
如果ds段寄存器不等于cs段寄存器,则cs:前缀是可选的.
请参阅此答案,了解零寄存器的最佳方法:(xor eax,eax
性能优势和较小的编码).
我将只考虑单个指令可以将寄存器归零的方式.如果允许从内存中加载零,有太多方法,所以我们主要排除从内存加载的指令.
我找到了10个不同的单个指令,它们将32位寄存器归零(因此在长模式下为完整的64位寄存器),没有任何其他存储器的前置条件或负载.这不计算相同insn的不同编码,或不同形式的mov
.如果你计算从已知存储零的存储器加载,或者从段寄存器或其他任何东西加载,那么就有很多方法.零矢量寄存器也有很多种方法.
对于大多数这些版本,eax和rax版本是针对相同功能的单独编码,它们都将整个64位寄存器归零,或者隐式地将上半部分归零,或者使用REX.W前缀显式写入完整寄存器.
# Works on any reg unless noted, usually of any size. eax/ax/al as placeholders
and eax, 0 ; three encodings: imm8, imm32, and eax-only imm32
andn eax, eax,eax ; BMI1 instruction set: dest = ~s1 & s2
imul eax, any,0 ; eax = something * 0. two encodings: imm8, imm32
lea eax, [0] ; absolute encoding (disp32 with no base or index). Use [abs 0] in NASM if you used DEFAULT REL
lea eax, [rel 0] ; YASM supports this, but NASM doesn't: use a RIP-relative encoding to address a specific absolute address, making position-dependent code
mov eax, 0 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 0 ; 7 bytes: REX mov r/m64, sign-extended-imm32. NASM optimizes mov rax,0 to the 5B version, but dword or strict dword stops it for some reason
mov rax, strict qword 0 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T. normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.
sub eax, eax ; recognized as a zeroing idiom on some but maybe not all CPUs
xor eax, eax ; Preferred idiom: recognized on all CPUs
@movzx:
movzx eax, byte ptr[@movzx + 6] //Because the last byte of this instruction is 0. neat hack from GJ.'s answer
.l: loop .l ; clears e/rcx... eventually. from I. J. Kennedy's answer. To operate on only ECX, use an address-size prefix.
; rep lodsb ; not counted because it's not safe (potential segfaults), but also zeros ecx
Run Code Online (Sandbox Code Playgroud)
"将所有位移出一端"对于常规大小的GP寄存器是不可能的,只有部分寄存器是不可能的. shl
和shr
班次计数被掩盖:count &= 31;
相当于count %= 32;
.(但是286和更早版本只有16位,因此ax
是一个"完整"寄存器.shr r/m16, imm8
指令的可变计数形式被添加286,所以有一些CPU可以将一个移位归零整个寄存器.)
另请注意,向量的移位计数饱和而不是换行.
# Zeroing methods that only work on 16bit or 8bit regs:
shl ax, 16 ; shift count is still masked to 0x1F for any operand size less than 64b. i.e. count %= 32
shr al, 16 ; so 8b and 16b shifts can zero registers.
# zeroing ah/bh/ch/dh: Low byte of the reg = whatever garbage was in the high16 reg
movxz eax, ah ; From Jerry Coffin's answer
Run Code Online (Sandbox Code Playgroud)
取决于其他现有条件(除了在另一个注册表中为零):
bextr eax, any, eax ; if al >= 32, or ah = 0. BMI1
BLSR eax, src ; if src only has one set bit
CDQ ; edx = sign-extend(eax)
sbb eax, eax ; if CF=0. (Only recognized on AMD CPUs as dependent only on flags (not eax))
setcc al ; with a condition that will produce a zero based on known state of flags
PSHUFB xmm0, all-ones ; xmm0 bytes are cleared when the mask bytes have their high bit set
Run Code Online (Sandbox Code Playgroud)
其中一些SSE2整数指令也可用于MMX寄存器(mm0
- mm7
).同样,最好的选择是某种形式的xor.要么PXOR
/ VPXOR
或XORPS
/ VXORPS
.
AVX vxorps xmm0,xmm0,xmm0
将整个ymm0/zmm0归零,并且优于vxorps ymm0,ymm0,ymm0
AMD CPU.这些归零指令有三种编码:传统SSE,AVX(VEX前缀)和AVX512(EVEX前缀),尽管SSE版本仅将底部128归零,这不是支持AVX或AVX512的CPU上的完整寄存器.无论如何,根据你的计算方式,每个条目可以是三个不同的指令(相同的操作码,只是不同的前缀).除此之外vzeroall
,AVX512没有改变(并且没有zmm16-31为零).
ANDNPD xmm0, xmm0
ANDNPS xmm0, xmm0
PANDN xmm0, xmm0 ; dest = ~dest & src
PCMPGTB xmm0, xmm0 ; n > n is always false.
PCMPGTW xmm0, xmm0 ; similarly, pcmpeqd is a good way to do _mm_set1_epi32(-1)
PCMPGTD xmm0, xmm0
PCMPGTQ xmm0, xmm0 ; SSE4.2, and slower than byte/word/dword
PSADBW xmm0, xmm0 ; sum of absolute differences
MPSADBW xmm0, xmm0, 0 ; SSE4.1. sum of absolute differences, register against itself with no offset. (imm8=0: same as PSADBW)
; shift-counts saturate and zero the reg, unlike for GP-register shifts
PSLLDQ xmm0, 16 ; left-shift the bytes in xmm0
PSRLDQ xmm0, 16 ; right-shift the bytes in xmm0
PSLLW xmm0, 16 ; left-shift the bits in each word
PSLLD xmm0, 32 ; double-word
PSLLQ xmm0, 64 ; quad-word
PSRLW/PSRLD/PSRLQ ; same but right shift
PSUBB/W/D/Q xmm0, xmm0 ; subtract packed elements, byte/word/dword/qword
PSUBSB/W xmm0, xmm0 ; sub with signed saturation
PSUBUSB/W xmm0, xmm0 ; sub with unsigned saturation
PXOR xmm0, xmm0
XORPD xmm0, xmm0
XORPS xmm0, xmm0
VZEROALL
# Can raise an exception on SNaN, so only usable if you know exceptions are masked
CMPLTPD xmm0, xmm0 # exception on QNaN or SNaN, or denormal
VCMPLT_OQPD xmm0, xmm0,xmm0 # exception only on SNaN or denormal
CMPLT_OQPS ditto
VCMPFALSE_OQPD xmm0, xmm0, xmm0 # This is really just another imm8 predicate value fro the same VCMPPD xmm,xmm,xmm, imm8 instruction. Same exception behaviour as LT_OQ.
Run Code Online (Sandbox Code Playgroud)
SUBPS xmm0, xmm0
和类似的东西不起作用,因为NaN-NaN = NaN,而不是零.
此外,FP指令可以引发NaN参数的异常,因此即使您知道掩码异常,CMPPS/PD也是安全的,并且您不关心可能在MXCSR中设置异常位.即使是AVX版本,随着谓词的扩展选择,也会#IA
在SNaN上升级."安静"的谓词只能抑制#IA
QNaN.CMPPS/PD也可以引发Denormal异常.
(请参阅insn set refact entry for CMPPD中的表格,或者最好是英特尔原始PDF格式,因为HTML摘录会破坏该表格.)
这里可能有几种选择,但我现在还不够好奇,要深入挖掘指令集列表,寻找所有这些选项.
但有一个值得一提的有趣的是:VPTERNLOGD/Q可以将寄存器设置为all-1,而imm8 = 0xFF.(但在当前实现上对旧值具有错误的依赖性).由于比较指令都比较为掩码,在我的测试中,VPTERNLOGD似乎是在Skylake-AVX512上将矢量设置为全1的最佳方法,尽管不特殊情况下imm8 = 0xFF情况以避免错误依赖.
VPTERNLOGD zmm0, zmm0,zmm0, 0 ; inputs can be any registers you like.
Run Code Online (Sandbox Code Playgroud)
只有一个选择(因为如果旧值是无穷大或NaN,则sub不起作用).
FLDZ ; push +0.0
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
33243 次 |
最近记录: |