为什么这种递归比等效迭代快得多？

Question

为什么这种递归比等效迭代快得多？

我多次被告知由于函数调用递归很慢,但在这段代码中,它似乎比迭代解决方案快得多.充其量,我通常期望编译器优化递归到迭代(看着程序集,似乎确实发生).

#include <iostream>

bool isDivisable(int x, int y)
{
    for (int i = y; i != 1; --i)
        if (x % i != 0)
            return false;
    return true;
}
bool isDivisableRec(int x, int y)
{
    if (y == 1)
        return true;
    return x % y == 0 && isDivisableRec(x, y-1);
}

int findSmallest()
{
    int x = 20;
    for (; !isDivisable(x,20); ++x);
    return x;
}

int main()
{
    std::cout << findSmallest() << std::endl;
}

Run Code Online (Sandbox Code Playgroud)

在这里汇编:https://gist.github.com/PatrickAupperle/2b56e16e9e5a6a9b251e

我很想知道这里发生了什么.我确信这是一些棘手的编译器优化,我可以惊讶地了解.

编辑:我刚刚意识到我忘了提到如果我使用递归版本,它运行大约0.25秒,迭代,大约.6.

编辑2:我正在使用-O3进行编译

$ g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4

Run Code Online (Sandbox Code Playgroud)

虽然,我不确定那些重要的事情.

编辑3:
更好的基准测试:
来源:http://gist.github.com/PatrickAupperle/ee8241ac51417437d012
输出:http://gist.github.com/PatrickAupperle/5870136a5552b83fd0f1
运行100次迭代显示非常相似的结果

编辑4:
在Roman的建议中,我在编译标志中添加了-fno-inline-functions -fno-inline-small-functions.这种效果对我来说非常奇怪.代码运行速度提高了大约15倍,但递归版本和迭代版本之间的比例仍然相似. https://gist.github.com/PatrickAupperle/3a87eb53a9f11c1f0bec

Answer 1

AnT*_*AnT 6

使用这段代码我也看到了Cygwin中GCC 4.9.3的大的时序差异(支持递归版本).我明白了

13.411 seconds for iterative
4.29101 seconds for recursive

Run Code Online (Sandbox Code Playgroud)

看看它生成的汇编代码-O3,我看到了两件事

编译器用isDivisableRec循环替换尾递归,然后展开循环:机器代码中循环的每次迭代都包含原始递归的两个级别.

_Z14isDivisableRecii:
.LFB1467:
    .seh_endprologue
    movl    %edx, %r8d
.L15:
    cmpl    $1, %r8d
    je  .L18
    movl    %ecx, %eax          ; First unrolled divisibility check
    cltd
    idivl   %r8d
    testl   %edx, %edx
    je  .L20
.L19:
    xorl    %eax, %eax
    ret
    .p2align 4,,10
.L20:
    leal    -1(%r8), %r9d
    cmpl    $1, %r9d
    jne .L21
    .p2align 4,,10
.L18:
    movl    $1, %eax
    ret
    .p2align 4,,10
.L21:
    movl    %ecx, %eax          ; Second unrolled divisibility check
    cltd
    idivl   %r9d
    testl   %edx, %edx
    jne .L19
    subl    $2, %r8d
    jmp .L15
    .seh_endproc

Run Code Online (Sandbox Code Playgroud)

编译器通过将它们提升为几个迭代来内联.由于参数of 的值是硬编码的,因为编译器设法替换迭代,... 用一些"魔法"代码直接内联到.实际调用只发生在参数值(如果它发生的话).isDivisableRecfindSmallestRecyisDivisableRec20201915findSmallestRecisDivisableRecy14

这是内联代码 findSmallestRec

    movl    $20, %ebx
    movl    $1717986919, %esi      ; Magic constants
    movl    $1808407283, %edi      ; for divisibility tests
    movl    $954437177, %ebp       ;
    movl    $2021161081, %r12d     ;
    movl    $-2004318071, %r13d    ;
    jmp .L28
    .p2align 4,,10
.L29:                              ; The main cycle
    addl    $1, %ebx
.L28:
    movl    %ebx, %eax             ; Divisibility by 20 test
    movl    %ebx, %ecx
    imull   %esi
    sarl    $31, %ecx
    sarl    $3, %edx
    subl    %ecx, %edx
    leal    (%rdx,%rdx,4), %eax
    sall    $2, %eax
    cmpl    %eax, %ebx
    jne .L29
    movl    %ebx, %eax             ; Divisibility by 19 test
    imull   %edi
    sarl    $3, %edx
    subl    %ecx, %edx
    leal    (%rdx,%rdx,8), %eax
    leal    (%rdx,%rax,2), %eax
    cmpl    %eax, %ebx
    jne .L29
    movl    %ebx, %eax             ; Divisibility by 18 test
    imull   %ebp
    sarl    $2, %edx
    subl    %ecx, %edx
    leal    (%rdx,%rdx,8), %eax
    addl    %eax, %eax
    cmpl    %eax, %ebx
    jne .L29
    movl    %ebx, %eax             ; Divisibility by 17 test
    imull   %r12d
    sarl    $3, %edx
    subl    %ecx, %edx
    movl    %edx, %eax
    sall    $4, %eax
    addl    %eax, %edx
    cmpl    %edx, %ebx
    jne .L29
    testb   $15, %bl               ; Divisibility by 16 test
    jne .L29
    movl    %ebx, %eax             ; Divisibility by 15 test
    imull   %r13d
    leal    (%rdx,%rbx), %eax
    sarl    $3, %eax
    subl    %ecx, %eax
    movl    %eax, %edx
    sall    $4, %edx
    subl    %eax, %edx
    cmpl    %edx, %ebx
    jne .L29
    movl    $14, %edx
    movl    %ebx, %ecx
    call    _Z14isDivisableRecii   ; call isDivisableRecii(x, 14)
    ...

Run Code Online (Sandbox Code Playgroud)

每次jne .L29跳转前的上述机器指令块都是可分性测试20,19... 15直接提升到findSmallestRec.显然,它们比内部isDivisableRec运行时使用的测试更有效y.正如您所看到的,16测试的可分性实现简单testb $15, %bl.因此,通过上述高度优化的代码可以及早捕获x高值的不可分割性y.

这些都不会发生- isDivisable并且findSmallest它们基本上是按字面翻译的.甚至周期也没有展开.

我相信这是第二次优化,最大限度地发挥了作用.编译器使用高度优化的方法来检查更高y值的可分性,这在编译时恰好是已知的.

如果isDivisableRec用"不可预测的"运行时值20(而不是硬编码的编译时常量20)替换第二个参数,它应该禁用此优化并使时间排成一行.我刚尝试了这个并最终结束了

12.9 seconds for iterative
13.26 seconds for recursive

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，7 月前
查看次数：	186 次
最近记录：	10 年，7 月前