Python字符串比较不会短路?

Gré*_*vet -1 python timing

通常的说法是,在检查密码或哈希等内容时,字符串比较必须在恒定时间内完成,因此建议避免a == b. 但是,我运行以下脚本,结果不支持a==b第一个不同字符上短路的假设。

from time import perf_counter_ns
import random

def timed_cmp(a, b):
    start = perf_counter_ns()
    a == b
    end = perf_counter_ns()
    return end - start

def n_timed_cmp(n, a, b):
    "average time for a==b done n times"
    ts = [timed_cmp(a, b) for _ in range(n)]
    return sum(ts) / len(ts)

def check_cmp_time():
    random.seed(123)
    # generate a random string of n characters
    n = 2 ** 8
    s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])

    # generate a list of strings, which all differs from the original string
    # by one character, at a different position
    # only do that for the first 50 char, it's enough to get data
    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]

    timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
    sorted_timed = sorted(timed, key=lambda t: t[1])

    # print the 10 fastest
    for x in sorted_timed[:10]:
        i, t = x
        print("{}\t{:3f}".format(i, t))

    print("---")
    i, t = timed[0]
    print("{}\t{:3f}".format(i, t))

    i, t = timed[1]
    print("{}\t{:3f}".format(i, t))

if __name__ == "__main__":
    check_cmp_time()

Run Code Online (Sandbox Code Playgroud)

这是运行的结果,重新运行脚本给出的结果略有不同,但没有令人满意的结果。

# ran with cpython 3.8.3

6   78.051700
1   78.203200
15  78.222700
14  78.384800
11  78.396300
12  78.441800
9   78.476900
13  78.519000
8   78.586200
3   78.631500
---
0   80.691100
1   78.203200
Run Code Online (Sandbox Code Playgroud)

我本以为最快的比较是第一个不同字符位于字符串开头的位置,但这不是我得到的。知道发生了什么事吗???

Jul*_*ard 5

这是有区别的,只是你在这么小的琴弦上看不到它。这是一个适用于您的代码的小补丁,因此我使用更长的字符串,并且通过将 A 放在原始字符串中从头到尾均匀间隔的位置来进行 10 次检查,我的意思是,如下所示:

A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
Run Code Online (Sandbox Code Playgroud)
@@ -15,13 +15,13 @@ def n_timed_cmp(n, a, b):
 def check_cmp_time():
     random.seed(123)
     # generate a random string of n characters
-    n = 2 ** 8
+    n = 2 ** 16
     s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])

     # generate a list of strings, which all differs from the original string
     # by one character, at a different position
     # only do that for the first 50 char, it's enough to get data
-    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+    diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]

     timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
     sorted_timed = sorted(timed, key=lambda t: t[1])
Run Code Online (Sandbox Code Playgroud)

你会得到:

0   122.621000
1   213.465700
2   380.214100
3   460.422000
5   694.278700
4   722.010000
7   894.630300
6   1020.722100
9   1149.473000
8   1341.754500
---
0   122.621000
1   213.465700
Run Code Online (Sandbox Code Playgroud)

请注意,在您的示例中,仅包含2**8字符,这已经很明显了,请应用此补丁:

@@ -21,7 +21,7 @@ def check_cmp_time():
     # generate a list of strings, which all differs from the original string
     # by one character, at a different position
     # only do that for the first 50 char, it's enough to get data
-    diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+    diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
 
     timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
     sorted_timed = sorted(timed, key=lambda t: t[1])
Run Code Online (Sandbox Code Playgroud)

仅保留两种极端情况(第一个字母更改与最后一个字母更改),您将得到:

$ python3 cmp.py
0   124.131800
1   135.566000
Run Code Online (Sandbox Code Playgroud)

数字可能会有所不同,但大多数时候 test0比 test 快一点1

为了更精确地隔离哪个字符被修改,只要 memcmp 逐个字符地进行修改,只要它不使用整数比较,通常在最后一个字符(如果它们未对齐)或非常短的字符串上进行比较,这是可能的8 个字符的字符串,正如我在这里演示的:

A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
Run Code Online (Sandbox Code Playgroud)

这给了我:

1   221.000000
2   222.000000
3   223.000000
4   223.000000
5   223.000000
6   223.000000
7   223.000000
0   241.000000
Run Code Online (Sandbox Code Playgroud)

差异非常小,Python 和 perf_counter_ns 可能不再是合适的工具。