jef*_*son 803 c++ optimization performance x86 assembly
我为Project Euler Q14编写了这两个解决方案,在汇编和C++中.它们是用于测试Collatz猜想的相同蛮力方法.装配解决方案与组装
nasm -felf64 p14.asm && gcc p14.o -o p14
Run Code Online (Sandbox Code Playgroud)
C++是用.编译的
g++ p14.cpp -o p14
Run Code Online (Sandbox Code Playgroud)
部件, p14.asm
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
Run Code Online (Sandbox Code Playgroud)
C++,p14.cpp
#include <iostream>
using namespace std;
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
cout << maxi << endl;
}
Run Code Online (Sandbox Code Playgroud)
我知道编译器优化以提高速度和一切,但我没有看到很多方法来进一步优化我的程序集解决方案(以编程方式而不是数学方式).
C++代码在每个术语和每个偶数项的除法中都有模数,其中汇编每个偶数项只有一个除法.
但是组装平均比C++解决方案长1秒.为什么是这样?我主要是好奇地问.
我的系统:1.4 GHz Linux 1.4 GHz Intel Celeron 2955U(Haswell微体系结构).
g++
(未经优化):平均1272毫秒
g++ -O3
平均578毫秒
原始asm(div)avg 2650 ms
Asm (shr)
平均679毫秒
@johnfound asm,与nasm avg组装501毫秒
@hidefromkgb asm avg 200 ms
@hidefromkgb asm优化@Peter Cordes avg 145 ms
@Veedrac C++ avg 81 ms with -O3
,305 ms with-O0
Pet*_*des 1848
If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0
(compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
通常你很好让编译器做它的事情,特别是如果你试图编写可以有效编译的C++.还看到汇编比编译语言快吗?.其中一个答案链接到这些简洁的幻灯片,显示各种C编译器如何使用很酷的技巧优化一些非常简单的功能.
even:
mov rbx, 2
xor rdx, rdx
div rbx
Run Code Online (Sandbox Code Playgroud)
在Intel Haswell上,div r64
是36 uops,延迟为 32-96个周期,每21-74个周期的吞吐量为1.(加上2 uop来设置RBX和零RDX,但是乱序执行可以提前运行). 像DIV这样的高uop计数指令是微编码的,这也可能导致前端瓶颈.在这种情况下,延迟是最相关的因素,因为它是循环携带的依赖链的一部分.
shr rax, 1
做同样的无符号除法:它是1 uop,1c延迟,每个时钟周期可以运行2.
相比之下,32位除法速度更快,但与换档相比仍然很糟糕.idiv r32
Haswell是9 uops,22-29c延迟,每8-11c吞吐量一个.
正如您从gcc的-O0
asm输出(Godbolt编译器浏览器)中看到的那样,它只使用移位指令.clang -O0
确实像你想象的那样天真地编译,即使使用64位IDIV两次也是如此.(当进行优化时,编译器会在源进行除法时使用IDIV的两个输出,并使用相同的操作数使用模数,如果它们完全使用IDIV的话)
海湾合作委员会没有完全天真的模式; 它总是通过GIMPLE进行转换,这意味着无法禁用某些"优化".这包括识别逐次除法和使用移位(2的幂)或固定点乘法逆(2的非幂)以避免IDIV(参见div_by_13
上面的godbolt链接).
gcc -Os
(针对大小进行优化)确实使用IDIV进行非2次幂除法,不幸的是,即使在乘法逆码仅略大但速度更快的情况下也是如此.
(本案例摘要:使用uint64_t n
)
首先,看一下优化的编译器输出是很有意思的.(-O3
). -O0
速度基本上没有意义.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3
and g++ -O3 -march=skylake
will do the right thing. (Or -march=native
.) Or -mtune=...
to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
手写asm是优化器的黑盒子,因此当内联使输入成为编译时常量时,常量传播不起作用.其他优化也会受到影响.在使用asm之前,请阅读https://gcc.gnu.org/wiki/DontUseInlineAsm.(并避免使用MSVC样式的内联asm:输入/输出必须通过内存,这会增加开销.)
在这种情况下:你n
的签名类型,gcc使用SAR/SHR/ADD序列给出正确的舍入.(对于负输入,IDIV和算术移位"舍入"不同,请参阅SAR insn set ref手册输入).(IDK如果gcc尝试过并且未能证明n
不能为负,或者是什么.签名溢出是未定义的行为,所以应该可以.)
你应该使用uint64_t n
,所以它只能SHR.因此它可以移植到long
只有32位的系统(例如x86-64 Windows).
BTW,gcc的优化 asm输出看起来相当不错(使用unsigned long n
):它内联的内部循环main()
执行此操作:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
Run Code Online (Sandbox Code Playgroud)
内循环是无分支的,循环携带依赖链的关键路径是:
总计:每次迭代5个周期,延迟瓶颈.乱序执行与此并行处理其他所有事情(理论上:我没有使用perf计数器进行测试,看它是否真的以5c/iter运行).
The FLAGS input of cmov
(produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
GCC did a pretty good job here. It could save one code byte by using inc edx
instead of add edx, 1
, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc
instead of test
/cmovz
.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
Run Code Online (Sandbox Code Playgroud)
See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency. The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n
values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const]
, which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2]
/inc rcx
is only 2c latency, faster than lea rcx, [rax + rax*2 + 1]
, on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead/throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main
, but that's fine because each thread can just check a range of n
values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max
/maxi
. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n
values have reached 1
before getting another pair of starting n
values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n
hadn't reached 1
yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n
values up in the air. Maybe only worth with 256b vector (4x uint64_t
).
I think the best strategy to make detection of a 1
"sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1
in an element, the increment-vector will have a zero, and +=0 is a no-op.
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vprllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# There may be a better way to do this blend, avoiding the bypass delay for an FP blend between integer insns, not sure. Probably worth it
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
Run Code Online (Sandbox Code Playgroud)
You can and should implement this with intrinsics, instead of hand-written asm.
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
@EOF points out that tzcnt
(or bsf
) could be used to do multiple n/=2
iterations in one step. That's probably better than SIMD vectorizing, because no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar n
s in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
Run Code Online (Sandbox Code Playgroud)
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed/power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop/1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1
, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n
can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF
, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll
even with -mno-bmi
.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written,
joh*_*und 98
声称C++编译器可以生成比合格的汇编语言程序员更优的代码是一个非常糟糕的错误.特别是在这种情况下.人总是可以使编码器能够更好地编写代码,并且这种特殊情况很好地说明了这种说法.
您所看到的时序差异是因为问题中的汇编代码在内部循环中远非最佳.
(以下代码为32位,但可以轻松转换为64位)
例如,序列函数可以优化为仅5条指令:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
Run Code Online (Sandbox Code Playgroud)
整个代码看起来像:
include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
Run Code Online (Sandbox Code Playgroud)
为了编译此代码,需要FreshLib.
在我的测试中,(1 GHz AMD A4-1200处理器),上面的代码比问题中的C++代码快了大约四倍(编译时-O0
:430 ms与1900 ms),速度提高了两倍多(430 ms与830 ms)编译C++代码时-O3
.
两个程序的输出相同:i = 837799时最大序列= 525.
gna*_*729 21
为了获得更好的性能:一个简单的变化是观察到在n = 3n + 1之后,n将是偶数,所以你可以立即除以2.并且n不会是1,所以你不需要测试它.所以你可以保存一些if语句并写:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Run Code Online (Sandbox Code Playgroud)
这是一个很大的胜利:如果你看看n的最低8位,所有步骤直到你除以2 8次完全取决于这8位.例如,如果最后8位是0x01,那就是二进制,你的数字是???? 0000 0001然后接下来的步骤是:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
Run Code Online (Sandbox Code Playgroud)
所以这些步骤都可以预测,256k + 1被81k + 1取代.所有组合都会发生类似的事情.所以你可以用一个大的switch语句创建一个循环:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run Code Online (Sandbox Code Playgroud)
运行循环直到n≤128,因为在那个点上n可以变为1,少于8个除以2,并且一次做8个或更多个步骤将使你错过第一次达到1的点.然后继续"正常"循环 - 或准备一个表格,告诉您需要多少步骤才能达到1.
PS.我强烈怀疑Peter Cordes的建议会让它变得更快.除了一个之外,根本没有条件分支,除非循环实际结束,否则将正确预测一个.所以代码就像是
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
Run Code Online (Sandbox Code Playgroud)
在实践中,您将测量一次处理n的最后9,10,11,12位是否会更快.对于每个位,表中的条目数将加倍,并且当表不再适合L1高速缓存时,我表示速度减慢.
PPS.如果你需要操作次数:在每次迭代中,我们完成八个除以2和一个可变数量的(3n + 1)操作,因此计算操作的一个明显方法是另一个数组.但我们实际上可以计算步数(基于循环的迭代次数).
我们可以稍微重新定义问题:如果奇数将n替换为(3n + 1)/ 2,如果是偶数则用n/2替换n.然后每次迭代都会完成8个步骤,但你可以考虑作弊:-)所以假设有r个操作n < - 3n + 1和s个操作n < - n/2.结果将非常精确地为n'= n*3 ^ r/2 ^ s,因为n < - 3n + 1意味着n < - 3n*(1 + 1/3n).取对数,我们发现r =(s + log2(n'/ n))/ log2(3).
如果我们执行循环直到n≤1,000,000并且具有预先计算的表,从任何起始点n≤1,000,000需要多少次迭代然后如上所述计算r,四舍五入到最接近的整数,将给出正确的结果,除非s真的很大.
hid*_*kgb 18
在一个相当无关的说明:更多的性能黑客!
遍历序列时,我们只能在当前元素的2邻域中获得3个可能的情况N
(首先显示):
飞跃过去,这些2种元素的方法来计算(N >> 1) + N + 1
,((N << 1) + N + 1) >> 1
并且N >> 2
,分别.
让我们证明对于情况(1)和(2)都可以使用第一个公式,(N >> 1) + N + 1
.
案例(1)显而易见.情况(2)暗示(N & 1) == 1
,因此如果我们假设(不失一般性)N是2位长且其位ba
从最重要到最不重要,那么a = 1
,以下成立:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
Run Code Online (Sandbox Code Playgroud)
哪里B = !b
.右移第一个结果给了我们我们想要的东西.
QED : (N & 1) == 1 ? (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1
.
事实证明,我们可以使用单个三元运算一次遍历序列2个元素.另外2倍时间减少.
生成的算法如下所示:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
在这里我们进行比较,n > 2
因为如果序列的总长度是奇数,则过程可以在2而不是1处停止.
让我们把它翻译成大会!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
@main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
@loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA @loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA @main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
@itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE @itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE @itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
@outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE @outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Run Code Online (Sandbox Code Playgroud)
使用以下命令编译:
nasm -f elf64 file.asm
ld -o file file.o
Run Code Online (Sandbox Code Playgroud)
见C和改进/ bugfixed版本的ASM的彼得·柯德斯在Godbolt.(编者注:很抱歉把我的东西放在你的答案中,但是我的答案达到了Godbolt链接+文字的30k char限制!)
在从源代码生成机器代码期间,C++程序被转换为汇编程序.说汇编比C++慢,这几乎是错误的.而且,生成的二进制代码因编译器而异.因此,智能C++编译器可能会生成比哑组装器代码更优化和更高效的二进制代码.
但是我相信你的分析方法有一定的缺陷.以下是分析的一般准则:
来自评论:
但是,这段代码永远不会停止(因为整数溢出)!?!伊夫·达乌斯特
对于许多数字,它不会溢出。
如果它会溢出 - 对于那些不幸的初始种子之一,溢出的数字很可能会收敛到 1 而不会再次溢出。
这仍然提出了一个有趣的问题,是否有一些溢出循环种子数?
任何简单的最终收敛系列都以二值的幂开始(足够明显?)。
2^64 将溢出为零,根据算法,这是未定义的无限循环(仅以 1 结束),但由于shr rax
产生 ZF=1 ,答案中的最佳解决方案将完成。
我们可以生产 2^64 吗?如果起始数为0x5555555555555555
,则为奇数,下一个数为 3n+1,即0xFFFFFFFFFFFFFFFF + 1
= 0
。理论上处于未定义的算法状态,但 johnfound 的优化答案将通过在 ZF=1 上退出来恢复。的cmp rax,1
彼得科尔德的将在无限循环结束(QED变体1,“小气鬼”通过未定义0
数量)。
一些更复杂的数字怎么样,它会在没有 的情况下创建循环0
?坦率地说,我不确定,我的数学理论太模糊了,无法认真思考如何认真对待它。但直觉上我会说,对于每个数字,该级数都会收敛到 1:0 < 数字,因为 3n+1 公式迟早会慢慢地将原始数字(或中间数)的每个非 2 质因数变成 2 的某个幂. 所以我们不需要担心原始系列的无限循环,只有溢出会阻碍我们。
所以我只是将几个数字放入表格中并查看了 8 位截断的数字。
有三个值溢出到0
: 227
,170
和85
(85
直接到0
, 其他两个前进到85
)。
但是创建循环溢出种子没有价值。
有趣的是,我做了一个检查,这是第一个遭受 8 位截断的数字,并且已经27
受到影响!它确实达到9232
了适当的非截断序列中的值(第一个截断值322
在第 12 步中),并且以非截断方式为 2-255 个输入数字中的任何一个达到的最大值是13120
(对于255
本身),最大步数收敛到1
大约是128
(+-2,不确定“1”是否要计数,等等......)。
有趣的是(对我来说)这个数字9232
对于许多其他源数字是最大的,它有什么特别之处?:-O 9232
= 0x2410
...嗯..不知道。
不幸的是,我无法深入了解这个系列,为什么它会收敛以及将它们截断为k位的含义是什么,但是在cmp number,1
终止条件下,当然可以将算法置于无限循环中,特定输入值以0
之后结尾截断。
但是27
8 位情况下溢出的值是一种警报,这看起来像如果您计算达到 value 的步骤1
数,对于总 k 位整数集合中的大多数数字,您将得到错误的结果。对于 8 位整数,256 个数字中的 146 个数字通过截断影响了系列(其中一些可能仍然意外地达到了正确的步数,我懒得检查)。
你没有发布编译器生成的代码,所以这里有一些猜测,但即使没有看到,也可以这样说:
test rax, 1
jpe even
Run Code Online (Sandbox Code Playgroud)
... 有 50% 的机会错误预测分支,而且代价高昂。
编译器几乎可以肯定会进行这两种计算(由于 div/mod 的延迟很长,因此乘加是“免费的”,因此成本高得可以忽略不计)并跟进 CMOV。当然,其中被错误预测的可能性为零。
对于Collatz问题,您可以通过缓存"尾巴"来显着提升性能.这是时间/记忆的权衡.请参阅:memoization(https://en.wikipedia.org/wiki/Memoization).您还可以查看动态编程解决方案,以进行其他时间/内存权衡.
示例python实现:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
144365 次 |
最近记录: |