wim*_*wim 7 python performance checksum md5 hashlib
_md5
当对缓慢的 stdlibhashlib.md5
实现感到沮丧时,发现这个没有记录。
在 macbook 上:
>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>
Run Code Online (Sandbox Code Playgroud)
在 Windows 盒子上:
>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>
Run Code Online (Sandbox Code Playgroud)
在 Linux 机器上:
>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>
Run Code Online (Sandbox Code Playgroud)
对于散列短消息,速度更快。对于长消息,类似的性能。
为什么它隐藏在下划线扩展模块中,为什么在 hashlib 中默认不使用这种更快的实现? 什么是_md5
模块,为什么它没有公共 API?
Until Python 2.5, hashes and digests were implemented in their own modules (e.g. [Python 2.Docs]: md5 - MD5 message digest algorithm).
\nStarting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests was added. Its purpose was to:
Offer an unified access method to the hashes / digests (via their name)
\nSwitch (by default) to an external cryptography provider (it seems the logical step to delegate to some entity specialized in that field, as maintaining all those algorithms could be an overkill). At that time OpenSSL was the best choice: mature enough, known and compatible (there were a bunch of similar Java providers, but those were pretty useless)
\nAs a side effect of #2., the Python implementations were hidden from the public API (renamed them: _md5, _sha1, _sha256, _sha512, and the latter ones added: _blake2, _sha3), as redundancy often creates confusions.
\nBut, another side effect was _hashlib.so dependency on OpenSSL\'s libcrypto*.so (this is Nix (at least Linux) specific, on Win, a static libeay32.lib was linked in _hashlib.pyd, and also _ssl.pyd (which I consider lame), till v3.7+, where OpenSSL .dlls are part of the Python installation).
\nProbably on 90%+ of the machines things were smooth, as OpenSSL was / is installed by default, but for those where it isn\'t, many things might get broken because for example hashlib is imported by many modules (one such example is random which itself gets imported by lots of others), so trivial pieces of code that are not related at all to cryptography (at least not at 1st sight) will stop working. That\'s why the old implementations are kept (but again, they are only fallbacks as OpenSSL versions are / should be better maintained).
\n\nRun Code Online (Sandbox Code Playgroud)\n[cfati@cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q059955854]> ~/sopr.sh\n### Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ###\n\n[064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\\"{0:}\\n{1:}\\n{2:}\\n{3:}\\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))"\n3.5.2 (default, Oct 8 2019, 13:06:37)\n[GCC 5.4.0 20160609]\n<module \'_md5\' (built-in)>\n<module \'_hashlib\' from \'/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so\'>\nOpenSSL 1.0.2g 1 Mar 2016\n[064bit-prompt]>\n[064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so\n linux-vdso.so.1 => (0x00007fffa7d0b000)\n libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000)\n libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000)\n libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000)\n /lib64/ld-linux-x86-64.so.2 (0x00007f50da271000)\n libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000)\n[064bit-prompt]>\n[064bit-prompt]> openssl version -a\nOpenSSL 1.0.2g 1 Mar 2016\nbuilt on: reproducible build, date unspecified\nplatform: debian-amd64\noptions: bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx)\ncompiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM\nOPENSSLDIR: "/usr/lib/ssl"\n[064bit-prompt]>\n[064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\\"A\\").hexdigest(), hl.md5(b\\"A\\").hexdigest())"\n7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29\n
According to [Python 3.Docs]: hashlib.algorithms_guaranteed:
\n\n\nA set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that \xe2\x80\x98md5\xe2\x80\x99 is in this list despite some upstream vendors offering an odd \xe2\x80\x9cFIPS compliant\xe2\x80\x9d Python build that excludes it.
\n
Below it\'s an example of a custom Python 2.7 installation (that I built quite a while ago, worth mentioning that it dynamically links to OpenSSL .dlls):
\n\n\nRun Code Online (Sandbox Code Playgroud)\n[cfati@CFATI-5510-0:e:\\Work\\Dev\\StackOverflow\\q059955854]> sopr.bat\n### Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ###\n\n[prompt]> "F:\\Install\\pc064\\HPE\\OPSWpython\\2.7.10__00\\python.exe" -c "import sys, ssl;print(\\"{0:}\\n{1:}\\".format(sys.version, ssl.OPENSSL_VERSION))"\n2.7.10 (default, Mar 8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]\nOpenSSL 1.0.2j-fips 26 Sep 2016\n\n[prompt]> "F:\\Install\\pc064\\HPE\\OPSWpython\\2.7.10__00\\python.exe" -c "import hashlib as hl;print(hl.md5(\\"A\\").hexdigest())"\n7fc56270e7a70fa81a5935b72eacbe29\n\n[prompt]> "F:\\Install\\pc064\\HPE\\OPSWpython\\2.7.10__00\\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\\"A\\").hexdigest())"\nTraceback (most recent call last):\n File "<string>", line 1, in <module>\nValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips\n
As for the speed question I can only speculate:
\nPython implementation was (obviously) written specifically for Python, meaning it is "more optimized" (yes, this is grammatically incorrect) for Python than a generic version, and also resides in python*.so (or the python executable itself)
\nOpenSSL实现驻留在libcrypto*.so中,并且由包装器_hashlib.so访问,该包装器在Python类型 ( PyObject* ) 和OpenSSL类型 ( EVP_MD_CTX* )之间进行来回转换
\n考虑到上述情况,前者(稍微)更快(至少对于小消息来说是有道理的,与散列本身相比,开销(函数调用和其他Python底层操作)占总时间的很大一部分)。还需要考虑其他因素(例如是否使用OpenSSL汇编器加速)。
\n以下是我自己的一些基准。
\n代码00.py:
\n#!/usr/bin/env python\n\nimport sys\nimport timeit\nfrom hashlib import md5 as md5_openssl\nfrom _md5 import md5 as md5_builtin\n\n\nMD5S = (\n md5_openssl,\n md5_builtin,\n)\n\n\ndef main(*argv):\n base_text = b"A"\n number = 1000000\n print("timeit attempts number: {:d}".format(number))\n #x = []\n #y = {}\n for count in range(0, 16):\n factor = 2 ** count\n text = base_text * factor\n globals_dict = {"text": text}\n #x.append(factor)\n print("\\nUsing a {:8d} (2 ** {:2d}) bytes message".format(len(text), count))\n for func in MD5S:\n globals_dict["md5"] = func\n t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)\n print(" {:12s} took: {:11.6f} seconds".format(func.__name__, t))\n #y.setdefault(func.__name__, []).append(t)\n #print(x, y)\n\n\nif __name__ == "__main__":\n print("Python {:s} {:03d}bit on {:s}\\n".format(" ".join(elem.strip() for elem in sys.version.split("\\n")),\n 64 if sys.maxsize > 0x100000000 else 32, sys.platform))\n rc = main(*sys.argv[1:])\n print("\\nDone.\\n")\n sys.exit(rc)\n
Run Code Online (Sandbox Code Playgroud)\n输出:
\nWin 10 pc064 (在Dell Precision 5510笔记本电脑上运行):
\n\n\nRun Code Online (Sandbox Code Playgroud)\n[prompt]> "e:\\Work\\Dev\\VEnvs\\py_pc064_03.07.06_test0\\Scripts\\python.exe" ./code00.py\nPython 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32\n\ntimeit attempts number: 1000000\n\nUsing a 1 (2 ** 0) bytes message\n openssl_md5 took: 0.449134 seconds\n md5 took: 0.120021 seconds\n\nUsing a 2 (2 ** 1) bytes message\n openssl_md5 took: 0.460399 seconds\n md5 took: 0.118555 seconds\n\nUsing a 4 (2 ** 2) bytes message\n openssl_md5 took: 0.451850 seconds\n md5 took: 0.121166 seconds\n\nUsing a 8 (2 ** 3) bytes message\n openssl_md5 took: 0.438398 seconds\n md5 took: 0.118127 seconds\n\nUsing a 16 (2 ** 4) bytes message\n openssl_md5 took: 0.454653 seconds\n md5 took: 0.122818 seconds\n\nUsing a 32 (2 ** 5) bytes message\n openssl_md5 took: 0.450776 seconds\n md5 took: 0.118594 seconds\n\nUsing a 64 (2 ** 6) bytes message\n openssl_md5 took: 0.555761 seconds\n md5 took: 0.278812 seconds\n\nUsing a 128 (2 ** 7) bytes message\n openssl_md5 took: 0.681296 seconds\n md5 took: 0.455921 seconds\n\nUsing a 256 (2 ** 8) bytes message\n openssl_md5 took: 0.895952 seconds\n md5 took: 0.807457 seconds\n\nUsing a 512 (2 ** 9) bytes message\n openssl_md5 took: 1.401584 seconds\n md5 took: 1.499279 seconds\n\nUsing a 1024 (2 ** 10) bytes message\n openssl_md5 took: 2.360966 seconds\n md5 took: 2.878650 seconds\n\nUsing a 2048 (2 ** 11) bytes message\n openssl_md5 took: 4.383245 seconds\n md5 took: 5.655477 seconds\n\nUsing a 4096 (2 ** 12) bytes message\n openssl_md5 took: 8.264774 seconds\n md5 took: 10.920909 seconds\n\nUsing a 8192 (2 ** 13) bytes message\n openssl_md5 took: 15.521947 seconds\n md5 took: 21.895179 seconds\n\nUsing a 16384 (2 ** 14) bytes message\n openssl_md5 took: 29.947287 seconds\n md5 took: 43.198639 seconds\n\nUsing a 32768 (2 ** 15) bytes message\n openssl_md5 took: 59.123447 seconds\n md5 took: 86.453821 seconds\n\nDone.\n
Ubuntu 16 pc064(虚拟机在上述机器上的VirtualBox中运行):
\n\n\nRun Code Online (Sandbox Code Playgroud)\n[064bit-prompt]> python3 ./code00.py\nPython 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux\n\ntimeit attempts number: 1000000\n\nUsing a 1 (2 ** 0) bytes message\n openssl_md5 took: 0.246166 seconds\n md5 took: 0.130589 seconds\n\nUsing a 2 (2 ** 1) bytes message\n openssl_md5 took: 0.251019 seconds\n md5 took: 0.127750 seconds\n\nUsing a 4 (2 ** 2) bytes message\n openssl_md5 took: 0.257018 seconds\n md5 took: 0.123116 seconds\n\nUsing a 8 (2 ** 3) bytes message\n openssl_md5 took: 0.245399 seconds\n md5 took: 0.128267 seconds\n\nUsing a 16 (2 ** 4) bytes message\n openssl_md5 took: 0.251832 seconds\n md5 took: 0.136373 seconds\n\nUsing a 32 (2 ** 5) bytes message\n openssl_md5 took: 0.248410 seconds\n md5 took: 0.140708 seconds\n\nUsing a 64 (2 ** 6) bytes message\n openssl_md5 took: 0.361016 seconds\n md5 took: 0.267021 seconds\n\nUsing a 128 (2 ** 7) bytes message\n openssl_md5 took: 0.478735 seconds\n md5 took: 0.413986 seconds\n\nUsing a 256 (2 ** 8) bytes message\n openssl_md5 took: 0.707602 seconds\n md5 took: 0.695042 seconds\n\nUsing a 512 (2 ** 9) bytes message\n openssl_md5 took: 1.216832 seconds\n md5 took: 1.268570 seconds\n\nUsing a 1024 (2 ** 10) bytes message\n openssl_md5 took: 2.122014 seconds\n md5 took: 2.429623 seconds\n\nUsing a 2048 (2 ** 11) bytes message\n openssl_md5 took: 4.158188 seconds\n md5 took: 4.847686 seconds\n\nUsing a 4096 (2 ** 12) bytes message\n openssl_md5 took: 7.839173 seconds\n md5 took: 9.242224 seconds\n\nUsing a 8192 (2 ** 13) bytes message\n openssl_md5 took: 15.282232 seconds\n md5 took: 18.368874 seconds\n\nUsing a 16384 (2 ** 14) bytes message\n openssl_md5 took: 30.681912 seconds\n md5 took: 36.755073 seconds\n\nUsing a 32768 (2 ** 15) bytes message\n openssl_md5 took: 60.230543 seconds\n md5 took: 73.237356 seconds\n\nDone.\n
Ubuntu 22 pc064(同一台机器上双启动):
\n\n\nRun Code Online (Sandbox Code Playgroud)\n[064bit prompt]> python ./code00.py \nPython 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] 064bit on linux\n\ntimeit attempts number: 1000000\n\nUsing a 1 (2 ** 0) bytes message\n openssl_md5 took: 0.258825 seconds\n md5 took: 0.092418 seconds\n\nUsing a 2 (2 ** 1) bytes message\n openssl_md5 took: 0.265123 seconds\n md5 took: 0.095969 seconds\n\nUsing a 4 (2 ** 2) bytes message\n openssl_md5 took: 0.273572 seconds\n md5 took: 0.098485 seconds\n\nUsing a 8 (2 ** 3) bytes message\n openssl_md5 took: 0.267524 seconds\n md5 took: 0.102606 seconds\n\nUsing a 16 (2 ** 4) bytes message\n openssl_md5 took: 0.295750 seconds\n md5 took: 0.102688 seconds\n\nUsing a 32 (2 ** 5) bytes message\n openssl_md5 took: 0.266704 seconds\n md5 took: 0.095375 seconds\n\nUsing a 64 (2 ** 6) bytes message\n openssl_md5 took: 0.350251 seconds\n md5 took: 0.209725 seconds\n\nUsing a 128 (2 ** 7) bytes message\n openssl_md5 took: 0.559193 seconds\n md5 took: 0.362671 seconds\n\nUsing a 256 (2 ** 8) bytes message\n openssl_md5 took: 0.685720 seconds\n md5 took: 0.589242 seconds\n\nUsing a 512 (2 ** 9) bytes message\n openssl_md5 took: 1.100991 seconds\n md5 took: 1.081601 seconds\n\nUsing a 1024 (2 ** 10) bytes message\n openssl_md5 took: 2.069975 seconds\n md5 took: 2.176450 seconds\n\nUsing a 2048 (2 ** 11) bytes message\n openssl_md5 took: 3.742486 seconds\n md5 took: 4.197531 seconds\n\nUsing a 4096 (2 ** 12) bytes message\n openssl_md5 took: 7.186287 seconds\n md5 took: 8.270421 seconds\n\nUsing a 8192 (2 ** 13) bytes message\n openssl_md5 took: 13.889762 seconds\n md5 took: 16.225811 seconds\n\nUsing a 16384 (2 ** 14) bytes message\n openssl_md5 took: 27.422105 seconds\n md5 took: 32.898019 seconds\n\nUsing a 32768 (2 ** 15) bytes message\n openssl_md5 took: 54.010482 seconds\n md5 took: 64.579159 seconds\n\nDone.\n
结果似乎和你的很不一样。就我而言:
\n从 [~ 512B .. ~ 1KiB ] 大小的消息中的某处开始,OpenSSL实现似乎比内置的执行更好
\n我知道结果太少,无法声明模式,但似乎两种实现似乎都与消息大小成线性比例(就时间而言)(但内置斜率似乎有点陡 - 这意味着它的性能会更差从长远来看)
\n总之,如果您的所有消息都很小,并且内置实现最适合您,那么请使用它。
\n图形表示(我必须将timeit迭代次数减少一个数量级,因为对于大消息来说,它会花费太长时间):
\n\n并放大两个图形相交的区域:
\n\n