用于测试Collat​​z猜想的C++代码比手写程序集更快 - 为什么?

jef*_*son 803 c++ optimization performance x86 assembly

我为Project Euler Q14编写了这两个解决方案,在汇编和C++中.它们是用于测试Collat​​z猜想的相同蛮力方法.装配解决方案与组装

nasm -felf64 p14.asm && gcc p14.o -o p14
Run Code Online (Sandbox Code Playgroud)

C++是用.编译的

g++ p14.cpp -o p14
Run Code Online (Sandbox Code Playgroud)

部件, p14.asm

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret
Run Code Online (Sandbox Code Playgroud)

C++,p14.cpp

#include <iostream>

using namespace std;

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = n*3 + 1;

        ++count;
    }

    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }

    cout << maxi << endl;
}
Run Code Online (Sandbox Code Playgroud)

我知道编译器优化以提高速度和一切,但我没有看到很多方法来进一步优化我的程序集解决方案(以编程方式而不是数学方式).

C++代码在每个术语和每个偶数项的除法中都有模数,其中汇编每个偶数项只有一个除法.

但是组装平均比C++解决方案长1秒.为什么是这样?我主要是好奇地问.

执行时间

我的系统:1.4 GHz Linux 1.4 GHz Intel Celeron 2955U(Haswell微体系结构).

Pet*_*des 1848

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the tag wiki for more perf links.

See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).

通常你很好让编译器做它的事情,特别是如果你试图编写可以有效编译的C++.还看到汇编比编译语言快吗?.其中一个答案链接到这些简洁的幻灯片,显示各种C编译器如何使用很酷的技巧优化一些非常简单的功能.


even:
    mov rbx, 2
    xor rdx, rdx
    div rbx
Run Code Online (Sandbox Code Playgroud)

在Intel Haswell上,div r64是36 uops,延迟为 32-96个周期,每21-74个周期的吞吐量为1.(加上2 uop来设置RBX和零RDX,但是乱序执行可以提前运行). 像DIV这样的高uop计数指令是微编码的,这也可能导致前端瓶颈.在这种情况下,延迟是最相关的因素,因为它是循环携带的依赖链的一部分.

shr rax, 1做同样的无符号除法:它是1 uop,1c延迟,每个时钟周期可以运行2.

相比之下,32位除法速度更快,但与换档相比仍然很糟糕.idiv r32Haswell是9 uops,22-29c延迟,每8-11c吞吐量一个.


正如您从gcc的-O0asm输出(Godbolt编译器浏览器)中看到的那样,它只使用移位指令.clang -O0确实像你想象的那样天真地编译,即使使用64位IDIV两次也是如此.(当进行优化时,编译器会在源进行除法时使用IDIV的两个输出,并使用相同的操作数使用模数,如果它们完全使用IDIV的话)

海湾合作委员会没有完全天真的模式; 它总是通过GIMPLE进行转换,这意味着无法禁用某些"优化".这包括识别逐次除法和使用移位(2的幂)或固定点乘法逆(2的非幂)以避免IDIV(参见div_by_13上面的godbolt链接).

gcc -Os(针对大小进行优化)确实使用IDIV进行非2次幂除法,不幸的是,即使在乘法逆码仅略大但速度更快的情况下也是如此.


帮助编译器

(本案例摘要:使用uint64_t n)

首先,看一下优化的编译器输出是很有意思的.(-O3). -O0速度基本上没有意义.

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

手写asm是优化器的黑盒子,因此当内联使输入成为编译时常量时,常量传播不起作用.其他优化也会受到影响.在使用asm之前,请阅读https://gcc.gnu.org/wiki/DontUseInlineAsm.(并避免使用MSVC样式的内联asm:输入/输出必须通过内存,这会增加开销.)

在这种情况下:你n的签名类型,gcc使用SAR/SHR/ADD序列给出正确的舍入.(对于负输入,IDIV和算术移位"舍入"不同,请参阅SAR insn set ref手册输入).(IDK如果gcc尝试过并且未能证明n不能为负,或者是什么.签名溢出是未定义的行为,所以应该可以.)

你应该使用uint64_t n,所以它只能SHR.因此它可以移植到long只有32位的系统(例如x86-64 Windows).


BTW,gcc的优化 asm输出看起来相当不错(使用unsigned long n):它内联的内部循环main()执行此操作:

 # from gcc5.4 -O3  plus my comments

 # edx= count=1
 # rax= uint64_t n

.L9:                   # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi         # rdi = n>>1;
    test   al, 1       # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi    # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1      # ++count;
    cmp    rax, 1
    jne   .L9          #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n
Run Code Online (Sandbox Code Playgroud)

内循环是无分支的,循环携带依赖链的关键路径是:

  • 3组分LEA(3个循环)
  • cmov(Haswell上2个循环,Broadwell上或之后1c).

总计:每次迭代5个周期,延迟瓶颈.乱序执行与此并行处理其他所有事情(理论上:我没有使用perf计数器进行测试,看它是否真的以5c/iter运行).

The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.

Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.

The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.


Beating the compiler

GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.

It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test/cmovz.

 ### Hand-optimized version of what gcc does
.L9:                       #do{
    lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
    shr     rax, 1         # n>>=1;    CF = n&1 = n%2
    cmovc   rax, rcx       # n= (n&1) ? 3*n+1 : n/2;
    inc     edx            # ++count;
    cmp     rax, 1
    jne     .L9            #}while(n!=1)
Run Code Online (Sandbox Code Playgroud)

See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)

Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency. The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.

Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).

LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).

So lea rcx, [rax + rax*2]/inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)


Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead/throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.

Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max/maxi. This creates more instruction-level parallelism.

The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.


You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).

I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.

Untested idea for manual vectorization

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vprllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # There may be a better way to do this blend, avoiding the bypass delay for an FP blend between integer insns, not sure.  Probably worth it
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.
Run Code Online (Sandbox Code Playgroud)

You can and should implement this with intrinsics, instead of hand-written asm.


Algorithmic/implementation improvement:

Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)

@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing, because no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.

So the loop might look like this:

goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);
Run Code Online (Sandbox Code Playgroud)

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed/power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop/1c latency).

It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).

We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).

TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.

They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written,

  • 让我感到惊讶的是这些令人难以置信的答案就是这些细节所展示的知识.我永远不会知道那个级别的语言或系统,我不知道如何.干得好先生. (77认同)
  • 传奇回答!! (6认同)
  • 我前段时间尝试了矢量化方法,它没有帮助(因为你可以用`tzcnt`在标量代码中做得更好,并且你被锁定到矢量化中矢量元素中运行时间最长的序列案件). (4认同)
  • 我管理的@jefferson Best是https://godbolt.org/g/1N70Ib.我希望我能做更聪明的事,但似乎没有. (4认同)
  • @EOF:不,我的意思是当任何*一个*向量元素击中"1"时,而不是在它们*全部*具有(可通过PCMPEQ/PMOVMSK轻松检测到)时,突破内循环.然后你使用PINSRQ和东西摆弄终止的一个元素(和它的计数器),然后跳回到循环中.当你经常突破内循环时,这很容易变成一种损失,但它确实意味着你总是在内循环的每次迭代中完成2或4个有用工作元素.但是关于记忆的好点. (3认同)
  • @PeterCordes我不确定你是否完全理解我的建议; 它比CMOV快得多.试试这个:https://godbolt.org/g/8qggOm. (2认同)
  • @PeterCordes跳跃确实看起来很浪费,但它们的成本似乎并不高.您可以删除它们以检查它们的成本 - 您将从正确的代码路径获得错误的结果.我发现差异是几个百分点; 分支预测器似乎在起作用. (2认同)
  • @Veedrac:Celeron/Pentium(甚至是Skylake Celeron)不支持任何VEX编码指令(没有AVX,没有BMI1/2).由于它不支持所有BMI1,它们甚至可能省略了不需要VEX前缀的BMI1指令(如TZCNT).即使它确实支持TZCNT,`-march = native`肯定不会启用它,因为CPU不报告BMI1支持.你最好的选择是BSF,它有内在函数.当输入非零时,它与TZCNT相同.(除了它设置不同的标志,并且在AMD上更慢.)有__builtin_ctz,以及BSF的Intel内在. (2认同)
  • @Jefferson:使用`-mbmi1`进行编译会告诉编译器它可以使用TZCNT.即使你的CPU不支持它,它也会以`rep bsf`运行,当输入非零时,它会给出相同的答案.但是,这也会告诉编译器它可以使用你的CPU不支持的BLSI.可能最简单的测试方法实际上是LZCNT,因为即使在非零情况下,它也会产生与BSR不同的结果.(但它仍然在CPU上解码为无法识别它的REP BSR).BSF和BSR在输入= 0时具有dest = undefined,实际行为是在Intel上未修改的dest =. (2认同)
  • @TED我没有编写程序集试图击败编译器,我写的是为了它的乐趣.我在解决方案中解决的大部分PE都没有涉及分区,这已经比我的C解决方案更快了; 我问过这一个,因为它简单的性质和长(强力)运行时间 (2认同)
  • @jefferson:不,我没有工作做这件事.我觉得这很酷.(不过,我有兴趣专业地做这件事.我现在花时间编写SO答案并开发开源软件).顺便说一句,当我学习asm时,我主要看了编译器输出,看看它是如何做的,并以此作为起点,看看我是否能看到不同的方法来做到这一点.这绝对是学习好习语的一种方式(偶尔会对你正在处理的问题进行一些非常简洁的转换,就像我在上一次更新中关于单迭代循环的评论中所描述的那样) (2认同)
  • @PeterCordes我遇到的最详细的答案,先生非常好地解释道. (2认同)
  • @csch:谢谢.我很高兴有这么多人从我写的东西中得到了一些东西.我为此感到非常自豪,并认为它能够很好地解释一些优化基础知识和与此问题相关的具体细节. (2认同)
  • @PeterCordes它确实帮助和启发了我.当我写这个问题的时候,我正在搞乱学习装配重做老项目的问题.几乎什么都不知道.不久之后,我能够为所有前100个问题编写比我的C解决方案更快的运行程序,甚至是一些较新的500级问题.我特别发现[问题571](https://projecteuler.net/problem=571)在asm中进行优化非常有趣.我希望我选择那个为lol打开问题.虽然问题14的简单性肯定有助于这个问题的发展. (2认同)
  • @nvmd:另一个好的起点是查看编译器输出;请参阅 [如何从 GCC/clang 程序集输出中删除“噪音”?](/sf/ask/2698648151/) 链接的 Matt Godbolt 的演讲以及其余的问答。 (2认同)

  • joh*_*und 98

    声称C++编译器可以生成比合格的汇编语言程序员更优的代码是一个非常糟糕的错误.特别是在这种情况下.人总是可以使编码器能够更好地编写代码,并且这种特殊情况很好地说明了这种说法.

    您所看到的时序差异是因为问题中的汇编代码在内部循环中远非最佳.

    (以下代码为32位,但可以轻松转换为64位)

    例如,序列函数可以优化为仅5条指令:

        .seq:
            inc     esi                 ; counter
            lea     edx, [3*eax+1]      ; edx = 3*n+1
            shr     eax, 1              ; eax = n/2
            cmovc   eax, edx            ; if CF eax = edx
            jnz     .seq                ; jmp if n<>1
    
    Run Code Online (Sandbox Code Playgroud)

    整个代码看起来像:

    include "%lib%/freshlib.inc"
    @BinaryType console, compact
    options.DebugMode = 1
    include "%lib%/freshlib.asm"
    
    start:
            InitializeAll
            mov ecx, 999999
            xor edi, edi        ; max
            xor ebx, ebx        ; max i
    
        .main_loop:
    
            xor     esi, esi
            mov     eax, ecx
    
        .seq:
            inc     esi                 ; counter
            lea     edx, [3*eax+1]      ; edx = 3*n+1
            shr     eax, 1              ; eax = n/2
            cmovc   eax, edx            ; if CF eax = edx
            jnz     .seq                ; jmp if n<>1
    
            cmp     edi, esi
            cmovb   edi, esi
            cmovb   ebx, ecx
    
            dec     ecx
            jnz     .main_loop
    
            OutputValue "Max sequence: ", edi, 10, -1
            OutputValue "Max index: ", ebx, 10, -1
    
            FinalizeAll
            stdcall TerminateAll, 0
    
    Run Code Online (Sandbox Code Playgroud)

    为了编译此代码,需要FreshLib.

    在我的测试中,(1 GHz AMD A4-1200处理器),上面的代码比问题中的C++代码快了大约四倍(编译时-O0:430 ms与1900 ms),速度提高了两倍多(430 ms与830 ms)编译C++代码时-O3.

    两个程序的输出相同:i = 837799时最大序列= 525.

    • "*更好地声明C++编译器是一个非常糟糕的错误.特别是在这种情况下.人类总是可以使代码更好,而且这个特殊问题很好地说明了这种说法.*"你可以改变它,它只会有效."*声称**人**是更好的是非常糟糕的错误.特别是在这种情况下.人类总是可以使代码**更糟****和这个特殊的**问题**是这个主张的良好例证.*"所以我认为你在这里没有意义,这种概括是错误的. (93认同)
    • @ luk32:技术熟练的人可以(并且通常应该)从编译器输出开始.因此,只要您对您的尝试进行基准测试以确保它们实际上更快(在您正在调整的目标硬件上),您就不能比编译器更糟糕.但是,我必须同意这是一个强烈的声明.编译器通常比新手asm编码器好得多.但与编译器提出的相比,通常可以保存一两条指令.(但并不总是在关键路径上,取决于uarch).它们是复杂机械的非常有用的部分,但它们并不"聪明". (29认同)
    • 嗯,这很聪明.仅当EAX为1(或0)时,SHR才设置ZF.我在优化gcc的`-O3`输出时错过了,但我确实发现了你对内循环所做的所有其他优化.(但是为什么你使用LEA代替INC而不是INC呢?那时可以使用clobber标志,并且导致除了P4之外的任何事情都会减速(对INC和SHR的旧标志都是错误的依赖).LEA可以' t运行在尽可能多的端口上,并可能导致资源冲突更频繁地延迟关键路径.) (6认同)
    • @ luk32 - 但问题的作者根本不能有任何论据,因为他对汇编语言的了解接近于零.关于人类与编译器的每一个论点都隐含地假定人类具有至少一些中等水平的知识.更多:定理"人类编写的代码总是会比编译器生成的代码更好或相同"很容易被正式证明. (5认同)
    • 哦,实际上Bulldozer可能会通过编译器输出来限制吞吐量.它具有较低的等待时间和CMOV 3分量LEA比的Haswell(我正在考虑),因此循环搬运DEP链只有3在您的代码周期.它也没有为整数寄存器零延迟MOV指令,因此G ++的浪费MOV指令实际上增加了关键路径的延迟,并且是一个大问题推土机.所以,是的,手工优化确实打在显著的方式编译器为不属于超现代够通过无用指令来咀嚼的CPU. (4认同)

    gna*_*729 21

    为了获得更好的性能:一个简单的变化是观察到在n = 3n + 1之后,n将是偶数,所以你可以立即除以2.并且n不会是1,所以你不需要测试它.所以你可以保存一些if语句并写:

    while (n % 2 == 0) n /= 2;
    if (n > 1) for (;;) {
        n = (3*n + 1) / 2;
        if (n % 2 == 0) {
            do n /= 2; while (n % 2 == 0);
            if (n == 1) break;
        }
    }
    
    Run Code Online (Sandbox Code Playgroud)

    这是一个很大的胜利:如果你看看n的最低8位,所有步骤直到你除以2 8次完全取决于这8位.例如,如果最后8位是0x01,那就是二进制,你的数字是???? 0000 0001然后接下来的步骤是:

    3n+1 -> ???? 0000 0100
    / 2  -> ???? ?000 0010
    / 2  -> ???? ??00 0001
    3n+1 -> ???? ??00 0100
    / 2  -> ???? ???0 0010
    / 2  -> ???? ???? 0001
    3n+1 -> ???? ???? 0100
    / 2  -> ???? ???? ?010
    / 2  -> ???? ???? ??01
    3n+1 -> ???? ???? ??00
    / 2  -> ???? ???? ???0
    / 2  -> ???? ???? ????
    
    Run Code Online (Sandbox Code Playgroud)

    所以这些步骤都可以预测,256k + 1被81k + 1取代.所有组合都会发生类似的事情.所以你可以用一个大的switch语句创建一个循环:

    k = n / 256;
    m = n % 256;
    
    switch (m) {
        case 0: n = 1 * k + 0; break;
        case 1: n = 81 * k + 1; break; 
        case 2: n = 81 * k + 1; break; 
        ...
        case 155: n = 729 * k + 425; break;
        ...
    }
    
    Run Code Online (Sandbox Code Playgroud)

    运行循环直到n≤128,因为在那个点上n可以变为1,少于8个除以2,并且一次做8个或更多个步骤将使你错过第一次达到1的点.然后继续"正常"循环 - 或准备一个表格,告诉您需要多少步骤才能达到1.

    PS.我强烈怀疑Peter Cordes的建议会让它变得更快.除了一个之外,根本没有条件分支,除非循环实际结束,否则将正确预测一个.所以代码就像是

    static const unsigned int multipliers [256] = { ... }
    static const unsigned int adders [256] = { ... }
    
    while (n > 128) {
        size_t lastBits = n % 256;
        n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
    }
    
    Run Code Online (Sandbox Code Playgroud)

    在实践中,您将测量一次处理n的最后9,10,11,12位是否会更快.对于每个位,表中的条目数将加倍,并且当表不再适合L1高速缓存时,我表示速度减慢.

    PPS.如果你需要操作次数:在每次迭代中,我们完成八个除以2和一个可变数量的(3n + 1)操作,因此计算操作的一个明显方法是另一个数组.但我们实际上可以计算步数(基于循环的迭代次数).

    我们可以稍微重新定义问题:如果奇数将n替换为(3n + 1)/ 2,如果是偶数则用n/2替换n.然后每次迭代都会完成8个步骤,但你可以考虑作弊:-)所以假设有r个操作n < - 3n + 1和s个操作n < - n/2.结果将非常精确地为n'= n*3 ^ r/2 ^ s,因为n < - 3n + 1意味着n < - 3n*(1 + 1/3n).取对数,我们发现r =(s + log2(n'/ n))/ log2(3).

    如果我们执行循环直到n≤1,000,000并且具有预先计算的表,从任何起始点n≤1,000,000需要多少次迭代然后如上所述计算r,四舍五入到最接近的整数,将给出正确的结果,除非s真的很大.

    • 或者为乘法和添加常量而不是开关创建数据查找表.索引两个256条目表比跳转表快,编译器可能不会寻找转换. (2认同)

    hid*_*kgb 18

    在一个相当无关的说明:更多的性能黑客!

    • [第一个«猜想»最终被@ShreevatsaR揭穿; 删除]

    • 遍历序列时,我们只能在当前元素的2邻域中获得3个可能的情况N(首先显示):

      1. [甚至] [奇数]
      2. [奇偶]
      3. [甚至] [偶数]

      飞跃过去,这些2种元素的方法来计算(N >> 1) + N + 1,((N << 1) + N + 1) >> 1并且N >> 2,分别.

      让我们证明对于情况(1)和(2)都可以使用第一个公式,(N >> 1) + N + 1.

      案例(1)显而易见.情况(2)暗示(N & 1) == 1,因此如果我们假设(不失一般性)N是2位长且其位ba从最重要到最不重要,那么a = 1,以下成立:

      (N << 1) + N + 1:     (N >> 1) + N + 1:
      
              b10                    b1
               b1                     b
             +  1                   + 1
             ----                   ---
             bBb0                   bBb
      
      Run Code Online (Sandbox Code Playgroud)

      哪里B = !b.右移第一个结果给了我们我们想要的东西.

      QED : (N & 1) == 1 ? (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.

      事实证明,我们可以使用单个三元运算一次遍历序列2个元素.另外2倍时间减少.

    生成的算法如下所示:

    uint64_t sequence(uint64_t size, uint64_t *path) {
        uint64_t n, i, c, maxi = 0, maxc = 0;
    
        for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
            c = 2;
            while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
                c += 2;
            if (n == 2)
                c++;
            if (c > maxc) {
                maxi = i;
                maxc = c;
            }
        }
        *path = maxc;
        return maxi;
    }
    
    int main() {
        uint64_t maxi, maxc;
    
        maxi = sequence(1000000, &maxc);
        printf("%llu, %llu\n", maxi, maxc);
        return 0;
    }
    
    Run Code Online (Sandbox Code Playgroud)

    在这里我们进行比较,n > 2因为如果序列的总长度是奇数,则过程可以在2而不是1处停止.

    [编辑:]

    让我们把它翻译成大会!

    MOV RCX, 1000000;
    
    
    
    DEC RCX;
    AND RCX, -2;
    XOR RAX, RAX;
    MOV RBX, RAX;
    
    @main:
      XOR RSI, RSI;
      LEA RDI, [RCX + 1];
    
      @loop:
        ADD RSI, 2;
        LEA RDX, [RDI + RDI*2 + 2];
        SHR RDX, 1;
        SHRD RDI, RDI, 2;    ror rdi,2   would do the same thing
        CMOVL RDI, RDX;      Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
        CMOVS RDI, RDX;
        CMP RDI, 2;
      JA @loop;
    
      LEA RDX, [RSI + 1];
      CMOVE RSI, RDX;
    
      CMP RAX, RSI;
      CMOVB RAX, RSI;
      CMOVB RBX, RCX;
    
      SUB RCX, 2;
    JA @main;
    
    
    
    MOV RDI, RCX;
    ADD RCX, 10;
    PUSH RDI;
    PUSH RCX;
    
    @itoa:
      XOR RDX, RDX;
      DIV RCX;
      ADD RDX, '0';
      PUSH RDX;
      TEST RAX, RAX;
    JNE @itoa;
    
      PUSH RCX;
      LEA RAX, [RBX + 1];
      TEST RBX, RBX;
      MOV RBX, RDI;
    JNE @itoa;
    
    POP RCX;
    INC RDI;
    MOV RDX, RDI;
    
    @outp:
      MOV RSI, RSP;
      MOV RAX, RDI;
      SYSCALL;
      POP RAX;
      TEST RAX, RAX;
    JNE @outp;
    
    LEA RAX, [RDI + 59];
    DEC RDI;
    SYSCALL;
    
    Run Code Online (Sandbox Code Playgroud)

    使用以下命令编译:

    nasm -f elf64 file.asm
    ld -o file file.o
    
    Run Code Online (Sandbox Code Playgroud)

    见C和改进/ bugfixed版本的ASM的彼得·柯德斯在Godbolt.(编者注:很抱歉把我的东西放在你的答案中,但是我的答案达到了Godbolt链接+文字的30k char限制!)

    • 没有积分的"Q",使得"12 = 3Q + 1".你的第一点是不对的. (2认同)
    • 以下是前66位记录制定者(OEIS上的A006877); 我用粗体标记偶数:**2,**3,**6,**7,9,**18,**25,27,**54,**73,97,129, 171,231,313,327,649,703,871,1161,2223,2463,2919,3711,6171,10971,13255,17647,23529,26623,34239,35655,55257,77031,106239,142587,156159, 216367,230631,410011,511935,626331,837799,1117065,1501353,1723519,2298025,3064033,3542887,3732423,5649499,6649279,8400511,11200681,14934241,15733191,**31466382,**36791535,63728127,**127456254,**169941673,226588897,268549803,**537099606,**670617279,**1341234558** (2认同)

    Man*_*hit 5

    在从源代码生成机器代码期间,C++程序被转换为汇编程序.说汇编比C++慢,这几乎是错误的.而且,生成的二进制代码因编译器而异.因此,智能C++编译器可能会生成比哑组装器代码更优化和更高效的二进制代码.

    但是我相信你的分析方法有一定的缺陷.以下是分析的一般准则:

    1. 确保您的系统处于正常/空闲状态.停止所有正在运行的进程(应用程序),或者密集使用CPU(或通过网络轮询).
    2. 您的数据大小必须更大.
    3. 您的测试必须运行超过5-10秒.
    4. 不要只依赖一个样本.进行N次测试.收集结果并计算结果的均值或中位数.

    • 我真的不知道这是如何回答这个问题的.关于汇编代码或C++代码*是否可能*更快,这不是一个模糊的问题 - 这是关于*实际代码*的一个非常具体的问题,他在问题本身中提供了帮助.您的答案甚至没有提及任何代码,或进行任何类型的比较.当然,关于如何进行基准测试的提示基本上是正确的,但还不足以做出实际答案. (14认同)
    • 它可能不仅仅是*测量错误,手写的asm代码使用的是64位DIV指令而不是右移.看我的回答.但是,正确测量也很重要. (9认同)
    • 子弹点比代码块更合适.请停止将文本放入代码块,因为它不是代码,也不会受益于等宽字体. (7认同)

    Ped*_*d7g 5

    来自评论:

    但是,这段代码永远不会停止(因为整数溢出)!?!伊夫·达乌斯特

    对于许多数字,它不会溢出。

    如果它溢出 - 对于那些不幸的初始种子之一,溢出的数字很可能会收敛到 1 而不会再次溢出。

    这仍然提出了一个有趣的问题,是否有一些溢出循环种子数?

    任何简单的最终收敛系列都以二值的幂开始(足够明显?)。

    2^64 将溢出为零,根据算法,这是未定义的无限循环(仅以 1 结束),但由于shr rax产生 ZF=1 ,答案中的最佳解决方案将完成。

    我们可以生产 2^64 吗?如果起始数为0x5555555555555555,则为奇数,下一个数为 3n+1,即0xFFFFFFFFFFFFFFFF + 1= 0。理论上处于未定义的算法状态,但 johnfound 的优化答案将通过在 ZF=1 上退出来恢复。的cmp rax,1彼得科尔德的将在无限循环结束(QED变体1,“小气鬼”通过未定义0数量)。

    一些更复杂的数字怎么样,它会在没有 的情况下创建循环0?坦率地说,我不确定,我的数学理论太模糊了,无法认真思考如何认真对待它。但直觉上我会说,对于每个数字,该级数都会收敛到 1:0 < 数字,因为 3n+1 公式迟早会慢慢地将原始数字(或中间数)的每个非 2 质因数变成 2 的某个幂. 所以我们不需要担心原始系列的无限循环,只有溢出会阻碍我们。

    所以我只是将几个数字放入表格中并查看了 8 位截断的数字。

    有三个值溢出到0: 227,17085(85直接到0, 其他两个前进到85)。

    但是创建循环溢出种子没有价值。

    有趣的是,我做了一个检查,这是第一个遭受 8 位截断的数字,并且已经27受到影响!它确实达到9232了适当的非截断序列中的值(第一个截断值322在第 12 步中),并且以非截断方式为 2-255 个输入数字中的任何一个达到的最大值是13120(对于255本身),最大步数收敛到1大约是128(+-2,不确定“1”是否要计数,等等......)。

    有趣的是(对我来说)这个数字9232对于许多其他源数字是最大的,它有什么特别之处?:-O 9232= 0x2410...嗯..不知道。

    不幸的是,我无法深入了解这个系列,为什么它会收敛以及将它们截断为k位的含义是什么,但是在cmp number,1终止条件下,当然可以将算法置于无限循环中,特定输入值以0之后结尾截断。

    但是278 位情况下溢出的值是一种警报,这看起来像如果您计算达到 value 的步骤1数,对于总 k 位整数集合中的大多数数字,您将得到错误的结果。对于 8 位整数,256 个数字中的 146 个数字通过截断影响了系列(其中一些可能仍然意外地达到了正确的步数,我懒得检查)。

    • 我应该早点检查一下原来的问题描述: *“虽然还没有被证明(Collat​​z问题),但人们认为所有起始数字都以1结束。”* ...好吧,难怪我无法掌握用我有限的模糊数学知识... :D 从我的实验表中我可以向你保证它确实对每个“2”-“255”数字收敛,要么不截断(到“1”),要么使用 8 位截断(三个数字为预期的“1”或“0”)。 (2认同)
    • 赞成对溢出时发生的情况进行分析。基于 CMP 的循环可以使用“cmp rax,1 / jna”(即“do{}while(n&gt;1)”)也以零终止。我考虑制作一个循环的检测版本来记录所看到的最大“n”,以了解我们距离溢出有多近。 (2认同)

    Dam*_*mon 5

    你没有发布编译器生成的代码,所以这里有一些猜测,但即使没有看到,也可以这样说:

    test rax, 1
    jpe even
    
    Run Code Online (Sandbox Code Playgroud)

    ... 有 50% 的机会错误预测分支,而且代价高昂。

    编译器几乎可以肯定会进行这两种计算(由于 div/mod 的延迟很长,因此乘加是“免费的”,因此成本高得可以忽略不计)并跟进 CMOV。当然,其中被错误预测的可能性为零


    Ema*_*olm 5

    对于Collat​​z问题,您可以通过缓存"尾巴"来显着提升性能.这是时间/记忆的权衡.请参阅:memoization(https://en.wikipedia.org/wiki/Memoization).您还可以查看动态编程解决方案,以进行其他时间/内存权衡.

    示例python实现:

    import sys
    
    inner_loop = 0
    
    def collatz_sequence(N, cache):
        global inner_loop
    
        l = [ ]
        stop = False
        n = N
    
        tails = [ ]
    
        while not stop:
            inner_loop += 1
            tmp = n
            l.append(n)
            if n <= 1:
                stop = True  
            elif n in cache:
                stop = True
            elif n % 2:
                n = 3*n + 1
            else:
                n = n // 2
            tails.append((tmp, len(l)))
    
        for key, offset in tails:
            if not key in cache:
                cache[key] = l[offset:]
    
        return l
    
    def gen_sequence(l, cache):
        for elem in l:
            yield elem
            if elem in cache:
                yield from gen_sequence(cache[elem], cache)
                raise StopIteration
    
    if __name__ == "__main__":
        le_cache = {}
    
        for n in range(1, 4711, 5):
            l = collatz_sequence(n, le_cache)
            print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
    
        print("inner_loop = {}".format(inner_loop))
    
    Run Code Online (Sandbox Code Playgroud)

    • 如果我理解上面的gnasher的想法,我认为尾部记忆是一种正交优化.所以你可以想象两者兼顾.研究通过向gnasher的算法添加memoization可以获得多少收益会很有趣. (2认同)
    • 我们可以通过仅存储结果的密集部分来使备忘录更便宜.设置N的上限,高于此值,甚至不检查内存.在其下方,使用hash(N) - > N作为哈希函数,因此key =数组中的位置,并且不需要存储."0"的条目表示尚未出现.我们可以通过仅在表中存储奇数N来进一步优化,因此散列函数为"n >> 1",丢弃1.将步骤代码写为始终以"n >> tzcnt(n)"或其他方式结束确保它很奇怪. (2认同)
    • 您可以为所有n <N存储预先计算的结果,对于某些大N.因此您不需要哈希表的开销.该表中的数据_will_最终将用于每个起始值.如果你只是想确认Collat​​z序列总是以(1,4,2,1,4,2,...)结束:这可以证明相当于证明n> 1,序列最终会小于原来的n.为此,缓存尾巴无济于事. (2认同)

    归档时间:

    查看次数:

    144365 次

    最近记录:

    7 年 前