Python vs CPP: Why is the difference in speed so huge?

Question

Python vs CPP: Why is the difference in speed so huge?

def main():
    i = 2
    sum = 1
    while i < 100000:
        j = 2
        while j < i:
            if i%j == 0:
                sum += 1
                break
            j += 1
        i += 1

    print(sum)


if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

#include<iostream>

using namespace std;

int main() {
    int sum = 1;
    for (int i=2; i<100000; i++) {
        for (int j=2; j<i; j++) {
            if (i%j == 0) {
                sum++;
                break;
            }
        }
    }
    cout << sum << endl;
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

C++

Run with: g++ -std=c++11 x.cpp -o x && time ./x

Time: ./x 1.36s user 0.00s system 99% cpu 1.376 total

Python

Run with: python x.py

Time: python x.py 32.10s user 0.21s system 98% cpu 32.854 total

Can anyone explain the huge difference between the time taken by the 2 programs? And what can be done to speed up the python one?

Answer 1

Sha*_*ger 17

Here's a simple example of the difference:

i++ in C++ compiles down to (on x86-64 machines) a simple inc REGISTER instruction. Takes a fraction of a cycle to execute.

i += 1 in Python can be disassembled with the dis module via dis.dis('i += 1') which informs us that the bytecode involved is:

  1           0 LOAD_NAME                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_NAME               0 (i)
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

Try it online!

Technically, all the instructions that end in _NAME become _FAST in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)/RETURN_VALUE pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real bytecode within a function would be more like:

  1           0 LOAD_FAST                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_FAST               0 (i)

Run Code Online (Sandbox Code Playgroud)

Each of those instructions requires either a run through a switch statement or a computed goto (depending on how CPython was compiled), loading the next instruction and updating code position information (it also involves repeatedly checking to ensure no other thread is asking for the GIL). LOAD_FAST and LOAD_CONST instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++ from before, except it has to change memory, not a register, so it's slower). STORE_FAST similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value). INPLACE_ADD has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int object, and also do more reference count adjustment.

Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.

I'll note that the relative cost here is highest for simple math operations; the more work a single bytecode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.

As for speeding it up, the main rules are:

Write Python code, not C code. You're manually maintaining your counters, when Python's range could do the job for you (and save a lot of individual bytecode instructions). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for loops over range rather than while loops with manual counter adjustment).
For mass math operations, use extension modules that can do the work in bulk, e.g. numpy. All that overhead for a single addition is bad; paying it for 1000 additions is pretty trivial.
Try alternate interpreters (e.g. PyPy)
Use Cython to compile C++ from your Python code (requires adding appropriate cdef declarations)
Use ctypes to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)

Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.

To address point #1, a Pythonic version of your code would look like:

def main():
    sum = 1
    for i in range(2, 100000):
        for j in range(2, i):
            if i%j == 0:
                sum += 1
                break

    print(sum)

if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

You could even replace the inner loop with:

    sum += any(i % j == 0 for j in range(2, i))

Run Code Online (Sandbox Code Playgroud)

though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.

For demonstration of the difference in bytecode complexity, consider a function that does nothing but run a loop with either while and a manual counter or for and range:

def whileloop(n):
    i = 0
    while i < n:
        i += 1

def forloop(n):
    for i in range(n):
        pass

Run Code Online (Sandbox Code Playgroud)

Disassembling each function shows:

  3           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (i)

  4           4 SETUP_LOOP              20 (to 26)
        >>    6 LOAD_FAST                1 (i)
              8 LOAD_FAST                0 (n)
             10 COMPARE_OP               0 (<)
             12 POP_JUMP_IF_FALSE       24

  5          14 LOAD_FAST                1 (i)
             16 LOAD_CONST               2 (1)
             18 INPLACE_ADD
             20 STORE_FAST               1 (i)
             22 JUMP_ABSOLUTE            6
        >>   24 POP_BLOCK
        >>   26 LOAD_CONST               0 (None)
             28 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

for whileloop and:

  8           0 SETUP_LOOP              16 (to 18)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_FAST                0 (n)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                 4 (to 16)
             12 STORE_FAST               1 (i)

  9          14 JUMP_ABSOLUTE           10
        >>   16 POP_BLOCK
        >>   18 LOAD_CONST               0 (None)
             20 RETURN_VALUE

Run Code Online (Sandbox Code Playgroud)

Try it online!

for forloop. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while runs from the LOAD_FAST following the SETUP_LOOP to the JUMP_ABSOLUTE, encompassing nine instructions per loop; for the for, it runs from the FOR_ITER to the JUMP_ABSOLUTE, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while loop.

Answer 2

Cri*_*ati 7

[SO]：Python vs CPP：为什么速度差异如此之大？（@ShadowRanger的答案）很好地解释了原因（幕后发生的理性）。这是我逐步完成的一些尝试。

设定：

操作系统，工具和其他信息。

[cfati@cfati-5510-0:/cygdrive/e/Work/Dev/StackOverflow/q057044727]> ~/sopr.sh
*** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***

[prompt]> uname -a
CYGWIN_NT-10.0 cfati-5510-0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
[prompt]>
[prompt]> python3 -c "import sys;print(\"Python {0:s} {1:d}bit on {2:s}\".format(\" \".join(item.strip() for item in sys.version.split(\"\n\")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))"
Python 3.6.8 (default, Feb 14 2019, 22:09:48) [GCC 7.4.0] 64bit on cygwin
[prompt]>
[prompt]> g++ --version | grep g++
g++ (GCC) 7.4.0
[prompt]>
[prompt]> ls
dll00.cpp  dll01.cpp  main00.cpp  script00.py  script01.py  script02.py  script03.py  script04.py

Run Code Online (Sandbox Code Playgroud)

C ++（0）：

将代码分成2个文件（稍后您会看到原因）。

dll00.cpp：

#include <iostream>

#if defined(_WIN32)
#  define DLL_EXPORT_API __declspec(dllexport)
#else
#  define DLL_EXPORT_API
#endif


using std::cout;
using std::endl;


DLL_EXPORT_API int func00() {
    int non_primes = 1;
    for (int i = 2; i < 100000; i++) {
        for (int j = 2; j < i; j++) {
            if (i % j == 0) {
                non_primes++;
                break;
            }
        }
    }
    cout << non_primes << endl;
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

main00.cpp：

#include "dll00.cpp"


int main() {
    return func00();
}

Run Code Online (Sandbox Code Playgroud)

输出：

[prompt]> g++ -std=c++11 main00.cpp -o main000
[prompt]>
[prompt]> time ./main000
90407

real    0m1.384s
user    0m1.359s
sys     0m0.000s

Run Code Online (Sandbox Code Playgroud)

script00.py：

您的原始脚本（进行了一些小的更正）。

#!/usr/bin/env python3


def main():
    non_primes = 1
    i = 2
    while i < 100000:
        j = 2
        while j < i:
            if i % j == 0:
                non_primes += 1
                break
            j += 1
        i += 1
    print(non_primes)


if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

输出：

[prompt]> time python3 script00.py
90407

real    0m53.738s
user    0m53.703s
sys     0m0.031s

Run Code Online (Sandbox Code Playgroud)

script01.py：

将for循环中的（无效）while循环替换为（使用range）。

#!/usr/bin/env python3


def main():
    non_primes = 1
    for i in range(2, 100000):
        for j in range(2, i):
            if i % j == 0:
                non_primes += 1
                break
    print(non_primes)


if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

输出：

[prompt]> time python3 script01.py
90407

real    0m34.142s
user    0m34.124s
sys     0m0.000s

Run Code Online (Sandbox Code Playgroud)

script02.py：

使用Python样式0相等性测试。

#!/usr/bin/env python3


def main():
    non_primes = 1
    for i in range(2, 100000):
        for j in range(2, i):
            if not i % j:
                non_primes += 1
                break
    print(non_primes)


if __name__ == "__main__":
    main()

Run Code Online (Sandbox Code Playgroud)

输出：

[prompt]> time python3 script02.py
90407

real    0m28.440s
user    0m28.406s
sys     0m0.031s

Run Code Online (Sandbox Code Playgroud)

script03.py：

具体针对这种情况。寻找除数的效率非常低。它迭代直到数字本身（实际上，它只应求平方根），从而产生许多无用的操作，从而加深了两种语言之间的性能差距。
```
#!/usr/bin/env python3

from math import sqrt


def main():
    non_primes = 1
    for i in range(2, 100000):
        for j in range(2, int(sqrt(i) + 1)):
            if not i % j:
                non_primes += 1
                break
    print(non_primes)


if __name__ == "__main__":
    main()
```
Run Code Online (Sandbox Code Playgroud)
输出：
```
[prompt]> time python3 script03.py
90407

real    0m0.291s
user    0m0.265s
sys     0m0.015s
```
Run Code Online (Sandbox Code Playgroud)
可以看出，它比以前的版本有巨大的差异（几乎快100倍），甚至比（原始）C代码还要好。

C ++（1）：

上一步对算法本身进行了操作。还要更改C ++变体，否则比较将不公平。

dll01.cpp：

#include <iostream>
#include <math.h>

#if defined(_WIN32)
#  define DLL_EXPORT_API __declspec(dllexport)
#else
#  define DLL_EXPORT_API
#endif


using std::cout;
using std::endl;


#if defined(__cplusplus)
extern "C" {
#endif

DLL_EXPORT_API int func00() {
    int non_primes = 1;
    for (int i = 2; i < 100000; i++) {
        for (int j = 2; j < static_cast<int>(sqrt(i) + 1); j++) {
            if (i % j == 0) {
                non_primes++;
                break;
            }
        }
    }
    cout << non_primes << endl;
    return 0;
}

#if defined(__cplusplus)
}
#endif

Run Code Online (Sandbox Code Playgroud)

必须（显然）对main00.cpp进行相应的修改（#include "dll01.cpp"）。

输出：

[prompt]> g++ -std=c++11 main00.cpp -o main001 [prompt]> [prompt]> time ./main001 90407 real 0m0.279s user 0m0.250s sys 0m0.030s
Run Code Online (Sandbox Code Playgroud)

通过[Python 3.Docs]从Python调用C ++代码（C接口）：ctypes-Python的外部函数库：

使用上一步中的C ++代码。

script04.py：

#!/usr/bin/env python3 import ctypes def main(): dll = ctypes.CDLL("./dll01.so") func = dll.func00 func.argtypes = [] func.restype = ctypes.c_int func() if __name__ == "__main__": main()
Run Code Online (Sandbox Code Playgroud)
输出：

[prompt]> g++ -std=c++11 -fPIC -shared dll01.cpp -o dll01.so [prompt]> [prompt]> time python3 script04.py 90407 real 0m0.327s user 0m0.281s sys 0m0.031s
Run Code Online (Sandbox Code Playgroud)

结论（从以上示例中得出）：

我已经将每个步骤运行了3次，并将中间结果放在这里。但是，具有有意义结果的测试应运行数千次，并应计算平均值。另外，我使用Cygwin的事实可能会干扰结果

编写Python ic代码，性能几乎提高了2倍（＃4。，＃5。）

编写高效的算法，将两种语言之间的差异几乎减小到0（＃ 6。vs. ＃7。），并且（纯）Python代码似乎比＃8的运行速度更快。。
但是，不要让自己被这些事实所欺骗。作为证明，如果操作数量的增长（而不是一定是由于效率低下），C ++将工作速度快了很多。
您可以通过执行步骤8进行检查。到dll00.cpp

@ShadowRanger：版本已经存在（***＃1。***中），添加了体系结构。 (2认同)

归档时间：	6 年，5 月前
查看次数：	323 次
最近记录：	6 年，3 月前