在Python 3中生成随机长度的类似随机字符串的最快方法

Kev*_*n91 23 python string random python-3.x

我知道如何创建随机字符串,如:

''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))
Run Code Online (Sandbox Code Playgroud)

但是,应该没有重复,所以我目前只是检查密钥是否已经存在于列表中,如下面的代码所示:

import secrets
import string
import numpy as np


amount_of_keys = 40000

keys = []

for i in range(0,amount_of_keys):
    N = np.random.randint(12,20)
    n_key = ''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))
    if not n_key in keys:
        keys.append(n_key)
Run Code Online (Sandbox Code Playgroud)

对于少量的键来说这是可以的40000,但是问题不能很好地扩展,因为有更多的键.所以我想知道是否有更快的方法来获得更多键的结果,比如999999

Mar*_*ers 46

基本改进,集合和本地名称

使用集合而不是列表,测试唯一性要快得多; 集合成员资格测试需要与设置大小无关的恒定时间,而列表需要O(N)线性时间.使用set comprehension一次生成一系列键,以避免必须查找并set.add()循环调用该方法; 适当随机,较大的键无论如何都有很小的机会产生重复.

因为这是在紧密循环中完成的,所以尽可能地优化所有名称查找是值得的:

import secrets
import numpy as np
from functools import partial

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(secrets.choice, string.ascii_uppercase + string.digits)
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys
Run Code Online (Sandbox Code Playgroud)

_randint关键字参数绑定np.random.randint名称中的功能的地方,这是更快的参考比全局变量,属性查找涉及时尤其如此.

pickchar()查找上的模块或更多当地人属性偏避免; 它是一个具有所有引用的单个可调用对象,因此执行速度更快,尤其是在循环中完成时.

while循环迭代保持只有有生产重复.如果没有重复,我们在单个集合理解中生成足够的密钥以填充其余部分.

为第一次改进计时

对于100件商品,差异并不大:

>>> timeit('p(100)', 'from __main__ import produce_amount_keys_list as p', number=1000)
8.720592894009314
>>> timeit('p(100)', 'from __main__ import produce_amount_keys_set as p', number=1000)
7.680242831003852
Run Code Online (Sandbox Code Playgroud)

但是当你开始扩展它时,你会注意到对列表的O(N)成员资格测试成本确实拖累了你的版本:

>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_list as p', number=10)
15.46253142200294
>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_set as p', number=10)
8.047800761007238
Run Code Online (Sandbox Code Playgroud)

我的版本几乎是10k项目的两倍; 40k项目可在约32秒内运行10次:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_list as p', number=10)
138.84072386901244
>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_set as p', number=10)
32.40720253501786
Run Code Online (Sandbox Code Playgroud)

列表版本花了2分多钟,超过十倍.

Numpy的random.choice函数,不是加密强大的

你可以通过放弃secrets模块并使用它np.random.choice()来更快地做到这一点; 但是,这不会产生加密级别随机性,但选择随机字符的速度是原来的两倍:

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys
Run Code Online (Sandbox Code Playgroud)

这产生了巨大的差异,现在只需16秒即可生成10倍40k的密钥:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_npchoice as p', number=10)
15.632006907981122
Run Code Online (Sandbox Code Playgroud)

使用itertools模块和生成器进一步调整

我们还可以从模块Recipes部分中获取unique_everseen()功能,使其具有唯一性,然后使用无限生成器和函数将结果限制为我们想要的数字:itertoolsitertools.islice()

# additional imports
from itertools import islice, repeat

# assumption: unique_everseen defined or imported

def produce_amount_keys(amount_of_keys):
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    def gen_keys(_range=range, _randint=np.random.randint):
        while True:
            yield ''.join([pickchar() for _ in _range(_randint(12, 20))])
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))
Run Code Online (Sandbox Code Playgroud)

这稍微快一些,但只是稍微如此:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_itertools as p', number=10)
14.698191125993617
Run Code Online (Sandbox Code Playgroud)

os.urandom()字节和生成字符串的不同方法

接下来,我们可以继续讨论Adam Barnes关于使用UUID4(基本上只是一个包装器os.urandom())和Base64 的想法.但是通过大小写折叠Base64并用随机选择的字符替换2个字符,他的方法严重限制了这些字符串中的熵(你不会产生可能的全部唯一值,一个只有(256 ** 15) / (36 ** 20)每个== 1 的20个字符的字符串)99437位熵!).

Base64编码既使用大写和小写字符和数字,但也增加了-/字符(或+_该URL安全的变体).对于大写字母和数字,您必须将输出大写并将这些额外的两个字符映射到其他随机字符,这个过程会从提供的随机数据中丢弃大量的熵os.urandom().您也可以使用Base32编码,而不是使用Base64,它使用大写字母和数字2到8,因此产生的字符串有32**n种可能性,而不是36**n.但是,这可以加快上述尝试的速度:

import os
import base64
import math

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=base64.b32encode, _randint=np.random.randint):
        # (count / math.log(256, 32)), rounded up, gives us the number of bytes
        # needed to produce *at least* count encoded characters
        factor = math.log(256, 32)
        input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(input_length[count]))[:count].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))
Run Code Online (Sandbox Code Playgroud)

真的很快:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b32 as p', number=10)
4.572628145979252
Run Code Online (Sandbox Code Playgroud)

40k键,10次,仅需4秒钟.所以大约快75倍; 使用os.urandom()作为来源的速度是不可否认的.

这是加密强大的 ; os.urandom()生成用于加密使用的字节.在另一方面,我们减小(超过90%产生的可能串的数量((36 ** 20) - (32 ** 20)) / (36 ** 20) * 100是90.5),我们不再使用0,1,89在输出数字.

所以也许我们应该使用这个urandom()技巧来产生适当的Base36编码; 我们必须制作我们自己的b36encode()功能:

import string
import math

def b36encode(b, 
        _range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
        _c=(string.ascii_uppercase + string.digits).encode()):
    """Encode a bytes value to Base36 (uppercase ASCII and digits)

    This isn't too friendly on memory because we convert the whole bytes
    object to an int, but for smaller inputs this should be fine.
    """
    b_int = _fb(b, 'big')
    length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
    return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))
Run Code Online (Sandbox Code Playgroud)

并使用:

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint):
        # (count / math.log(256, 36)), rounded up, gives us the number of bytes
        # needed to produce *at least* count encoded characters
        factor = math.log(256, 36)
        input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(input_length[count]))[-count:].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))
Run Code Online (Sandbox Code Playgroud)

这相当快,最重要的是产生36个大写字母和数字的全部范围:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b36 as p', number=10)
8.099918447987875
Run Code Online (Sandbox Code Playgroud)

当然,base32版本几乎是这个版本的两倍(由于使用表格的高效Python实现),但使用自定义Base36编码器的速度仍然是非加密安全numpy.random.choice()版本的两倍.

但是,再次使用os.urandom() 产生偏差 ; 我们必须产生比12到19个base36'数字'所需的更多比特的熵.例如,对于17位数字,我们不能使用字节产生36**17个不同的值,只有最接近的256**11字节,大约是1.08倍,因此我们最终会产生偏差朝向A,B以及在较小程度上C(感谢Stefan Pochmann指出这一点).

在下面选取一个整数(36 ** length)并将整数映射到base36

所以我们需要找到一种安全的随机方法,它可以为我们提供在0(包含)和36 ** (desired length)(不包含)之间均匀分布的值.然后我们可以将数字直接映射到所需的字符串.

首先,将整数映射到字符串; 以下内容已经过调整,以便最快地生成输出字符串:

def b36number(n, length, _range=range, _c=string.ascii_uppercase + string.digits):
    """Convert an integer to Base36 (uppercase ASCII and digits)"""
    chars = [_c[0]] * length
    while n:
        length -= 1
        chars[length] = _c[n % 36]
        n //= 36
    return ''.join(chars)
Run Code Online (Sandbox Code Playgroud)

接下来,我们需要一种快速且加密的安全方法来选择范围内的数字.你仍然可以使用os.urandom()它,但是你必须将字节掩盖到最大位数,然后循环直到你的实际值低于限制.这实际上已经由secrets.randbelow()函数实现了.在Python版本<3.6中,您可以使用random.SystemRandom().randrange(),它使用完全相同的方法和一些额外的包装来支持大于0的下限和步长.

使用secrets.randbelow()该功能变为:

import secrets

def produce_amount_keys(amount_of_keys):
    def gen_keys(_below=secrets.randbelow, _encode=b36number, _randint=np.random.randint):
        limit = [None] * 12 + [36 ** l for l in range(12, 20)]
        while True:
            count = _randint(12, 20)
            yield _encode(_below(limit[count]), count)
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))
Run Code Online (Sandbox Code Playgroud)

然后这非常接近(可能有偏见的)base64解决方案:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_below as p', number=10)
5.135716405988205
Run Code Online (Sandbox Code Playgroud)

这几乎与Base32方法一样快,但产生全系列键!

  • @Michael:参见[在Python中没有\ [\]的列表理解](// stackoverflow.com/a/9061024) (2认同)

Ada*_*nes 8

所以这是速度赛吗?

在Martijn Pieters的工作基础上,我有一个巧妙利用另一个库来生成随机字符串的解决方案: uuid.

我的解决方案是生成一个uuid4base64对它进行编码并将其大写,以便只获取我们所追求的字符,然后将其切片为随机长度.

这适用于这种情况,因为我们之后的输出长度(12-20)比uuid4的最短base64编码短.它也非常快,因为uuid速度非常快.

我也把它变成了发电机而不是常规功能,因为它们可以更有效率.

有趣的是,使用标准库的randint功能比使用标准库更快numpy.

这是测试输出:

Timing 40k keys 10 times with produce_amount_keys
20.899942063027993
Timing 40k keys 10 times with produce_amount_keys, stdlib randint
20.85920040300698
Timing 40k keys 10 times with uuidgen
3.852462349983398
Timing 40k keys 10 times with uuidgen, stdlib randint
3.136272903997451
Run Code Online (Sandbox Code Playgroud)

这是以下代码uuidgen():

def uuidgen(count, _randint=np.random.randint):
    generated = set()

    while True:
        if len(generated) == count:
            return

        candidate = b64encode(uuid4().hex.encode()).upper()[:_randint(12, 20)]
        if candidate not in generated:
            generated.add(candidate)
            yield candidate
Run Code Online (Sandbox Code Playgroud)

这里是整个项目.(在撰写本文时提交d9925d).


感谢Martijn Pieters的反馈,我对方法进行了一些改进,增加了熵,并将其加速了大约1/6.

将所有小写字母转换为大写字母时仍然存在大量熵损失.如果这是重要的,那么它可能建议使用b32encode()来代替,其中有我们想要的字符,减0,1,8,和9.

新解决方案如下:

def urandomgen(count):
    generated = set()

    while True:
        if len(generated) == count:
            return

        desired_length = randint(12, 20)

        # # Faster than math.ceil
        # urandom_bytes = urandom(((desired_length + 1) * 3) // 4)
        #
        # candidate = b64encode(urandom_bytes, b'//').upper()
        #
        # The above is rolled into one line to cut down on execution
        # time stemming from locals() dictionary access.

        candidate = b64encode(
            urandom(((desired_length + 1) * 3) // 4),
            b'//',
        ).upper()[:desired_length]

        while b'/' in candidate:
            candidate = candidate.replace(b'/', choice(ALLOWED_CHARS), 1)

        if candidate not in generated:
            generated.add(candidate)
            yield candidate.decode()
Run Code Online (Sandbox Code Playgroud)

并测试输出:

Timing 40k keys 10 times with produce_amount_keys, stdlib randint
19.64966493297834
Timing 40k keys 10 times with uuidgen, stdlib randint
4.063803717988776
Timing 40k keys 10 times with urandomgen, stdlib randint
2.4056471119984053
Run Code Online (Sandbox Code Playgroud)

我的存储库中的新提交是5625fd.


Martijn对熵的评论让我思考.我使用的方法base64.upper()做标记,从而比数字更常见.我用一个更加二元的思想重新审视了这个问题.

我们的想法是从中获取输出os.urandom(),将其解释为6位无符号数字的长字符串,并将这些数字用作允许字符滚动数组的索引.第一个6位数字将从范围中选择一个字符A..Z0..9A..Z01,第二个6位数字将从范围中选择一个字符2..9A..Z0..9A..T,依此类推.

这有点轻微的熵,因为第一个字符的可能性稍微不大2..9,第二个字符不太可能包含U..Z0,等等,但它比以前好多了.

它比稍快uuidgen(),稍慢urandomgen(),如下所示:

Timing 40k keys 10 times with produce_amount_keys, stdlib randint
20.440480664998177
Timing 40k keys 10 times with uuidgen, stdlib randint
3.430628580001212
Timing 40k keys 10 times with urandomgen, stdlib randint
2.0875444510020316
Timing 40k keys 10 times with bytegen, stdlib randint
2.8740892770001665
Run Code Online (Sandbox Code Playgroud)

我不完全确定如何消除最后一点熵破碎; 抵消角色的起点将只是稍微移动模式,随机化偏移将是缓慢的,拖延地图仍然会有一段时间...我对想法持开放态度.

新代码如下:

from os import urandom
from random import randint
from string import ascii_uppercase, digits

# Masks for extracting the numbers we want from the maximum possible
# length of `urandom_bytes`.
bitmasks = [(0b111111 << (i * 6), i) for i in range(20)]
allowed_chars = (ascii_uppercase + digits) * 16  # 576 chars long


def bytegen(count):
    generated = set()

    while True:
        if len(generated) == count:
            return

        # Generate 9 characters from 9x6 bits
        desired_length = randint(12, 20)
        bytes_needed = (((desired_length * 6) - 1) // 8) + 1

        # Endianness doesn't matter.
        urandom_bytes = int.from_bytes(urandom(bytes_needed), 'big')

        chars = [
            allowed_chars[
                (((urandom_bytes & bitmask) >> (i * 6)) + (0b111111 * i)) % 576
            ]
            for bitmask, i in bitmasks
        ][:desired_length]

        candidate = ''.join(chars)

        if candidate not in generated:
            generated.add(candidate)
            yield candidate
Run Code Online (Sandbox Code Playgroud)

完整的代码以及有关实现的更深入的自述文件已在de0db8上完成.

我尝试了几件事来加快实施速度,在回购中可见.肯定有帮助的是字符编码,其中数字和ASCII大写字母是连续的.


Ste*_*ann 5

一个简单而快速的方法:

def b36(n, N, chars=string.ascii_uppercase + string.digits):
    s = ''
    for _ in range(N):
        s += chars[n % 36]
        n //= 36
    return s

def produce_amount_keys(amount_of_keys):
    keys = set()
    while len(keys) < amount_of_keys:
        N = np.random.randint(12, 20)
        keys.add(b36(secrets.randbelow(36**N), N))
    return keys
Run Code Online (Sandbox Code Playgroud)

--编辑:以下内容参考了 Martijn 答案的先前修订版。经过我们的讨论,他添加了另一个解决方案,该解决方案与我的基本相同,但进行了一些优化。不过,它们并没有多大帮助,在我的测试中,它只比我的快 3.4% 左右,所以在我看来,它们大多只是让事情变得复杂。--

与 Martijn 在他接受的答案中的最终解决方案相比,我的解决方案要简单得多,大约快 1.7 倍,并且没有偏见:

Stefan
8.246490597876106 seconds.
8 different lengths from 12 to 19
  Least common length 19 appeared 124357 times.
  Most common length 16 appeared 125424 times.
36 different characters from 0 to Z
  Least common character Q appeared 429324 times.
  Most common character Y appeared 431433 times.
36 different first characters from 0 to Z
  Least common first character C appeared 27381 times.
  Most common first character Q appeared 28139 times.
36 different last characters from 0 to Z
  Least common last character Q appeared 27301 times.
  Most common last character E appeared 28109 times.

Martijn
14.253227412021943 seconds.
8 different lengths from 12 to 19
  Least common length 13 appeared 124753 times.
  Most common length 15 appeared 125339 times.
36 different characters from 0 to Z
  Least common character 9 appeared 428176 times.
  Most common character C appeared 434029 times.
36 different first characters from 0 to Z
  Least common first character 8 appeared 25774 times.
  Most common first character A appeared 31620 times.
36 different last characters from 0 to Z
  Least common last character Y appeared 27440 times.
  Most common last character X appeared 28168 times.
Run Code Online (Sandbox Code Playgroud)

Martijn 对第一个角色有偏见,A出现的次数太多,也8很少。我运行了十次测试,他最常见的第一个字符始终是Aor B(各五次),而他最不常见的字符始终是7, 8or 9(分别是两次、三次和五次)。我还分别检查了长度,长度17特别糟糕,他最常见的第一个字符总是出现大约51500次,而他最不常见的第一个字符出现大约25400次。

有趣的旁注:我正在使用secretsMartijn 驳回的模块:-)

我的整个脚本:

import string
import secrets
import numpy as np
import os
from itertools import islice, filterfalse
import math

#------------------------------------------------------------------------------------
#   Stefan
#------------------------------------------------------------------------------------

def b36(n, N, chars=string.ascii_uppercase + string.digits):
    s = ''
    for _ in range(N):
        s += chars[n % 36]
        n //= 36
    return s

def produce_amount_keys_stefan(amount_of_keys):
    keys = set()
    while len(keys) < amount_of_keys:
        N = np.random.randint(12, 20)
        keys.add(b36(secrets.randbelow(36**N), N))
    return keys

#------------------------------------------------------------------------------------
#   Martijn
#------------------------------------------------------------------------------------

def b36encode(b, 
        _range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
        _c=(string.ascii_uppercase + string.digits).encode()):
    b_int = _fb(b, 'big')
    length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
    return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))

def produce_amount_keys_martijn(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint, _factor=math.log(256, 36)):
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(math.ceil(count / _factor)))[-count:].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

#------------------------------------------------------------------------------------
#   Needed for Martijn
#------------------------------------------------------------------------------------

def unique_everseen(iterable, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

#------------------------------------------------------------------------------------
#   Benchmark and quality check
#------------------------------------------------------------------------------------

from timeit import timeit
from collections import Counter

def check(name, func):
    print()
    print(name)

    # Get 999999 keys and report the time.
    keys = None
    def getkeys():
        nonlocal keys
        keys = func(999999)
    t = timeit(getkeys, number=1)
    print(t, 'seconds.')

    # Report statistics about lengths and characters
    def statistics(label, values):
        ctr = Counter(values)
        least = min(ctr, key=ctr.get)
        most = max(ctr, key=ctr.get)
        print(len(ctr), f'different {label}s from', min(ctr), 'to', max(ctr))
        print(f'  Least common {label}', least, 'appeared', ctr[least], 'times.')
        print(f'  Most common {label}', most, 'appeared', ctr[most], 'times.')
    statistics('length', map(len, keys))
    statistics('character', ''.join(keys))
    statistics('first character', (k[0] for k in keys))
    statistics('last character', (k[-1] for k in keys))

for _ in range(2):
    check('Stefan', produce_amount_keys_stefan)
    check('Martijn', produce_amount_keys_martijn)
Run Code Online (Sandbox Code Playgroud)