在python中替换大量字符串?

Mik*_*cic 42 python regex string performance replace

假设我有一个看起来像这样的字符串:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
Run Code Online (Sandbox Code Playgroud)

您会注意到字符串中有很多位置,其中有一个&符号,后跟一个字符(例如"&y"和"&c").我需要用字典中的适当值替换这些字符,如下所示:

dict = {"&y":"\033[0;30m",
        "&c":"\033[0;31m",
        "&b":"\033[0;32m",
        "&Y":"\033[0;33m",
        "&u":"\033[0;34m"}
Run Code Online (Sandbox Code Playgroud)

最快的方法是什么?我可以手动找到所有的&符号,然后循环通过字典来改变它们,但这似乎很慢.做一堆正则表达式替换似乎也很慢(我的实际代码中将有一个大约30-40对的字典).

任何建议表示赞赏,谢谢.

编辑:

正如在这个问题的评论中所指出的,我的字典是在运行时之前定义的,并且在应用程序生命周期的过程中永远不会改变.它是ANSI转义序列的列表,其中将包含大约40个项目.我要比较的平均字符串长度大约为500个字符,但最多可达5000个字符(但这些字符很少见).我目前也在使用Python 2.6.

编辑#2 我接受Tor Valamos的回答是正确的,因为它不仅提供了有效的解决方案(虽然它不是最好的解决方案),而是考虑了所有其他解决方案并做了大量的工作来比较所有这些.这个答案是我在StackOverflow上遇到过的最好,最有帮助的答案之一.感谢你.

Tor*_*amo 30

mydict = {"&y":"\033[0;30m",
          "&c":"\033[0;31m",
          "&b":"\033[0;32m",
          "&Y":"\033[0;33m",
          "&u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

for k, v in mydict.iteritems():
    mystr = mystr.replace(k, v)

print mystr
The ?[0;30mquick ?[0;31mbrown ?[0;32mfox ?[0;33mjumps over the ?[0;34mlazy dog
Run Code Online (Sandbox Code Playgroud)

我冒昧地比较了几个解决方案:

mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print 'Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr))

# My solution
t = time()
for x in range(rep):
    for k, v in mydict.items():
        mystr.replace(k, v)
    #print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t

# Claudiu
def multiple_replace(dict, text): 
    # Create a regular expression  from the dictionary keys
    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

t = time()
for x in range(rep):
    multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t

# Andrew Y - fixed
def repl(s):
  return mydict["&"+s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t
Run Code Online (Sandbox Code Playgroud)

Python 2.6中的结果

Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict      1.04699993134
Peter fixed & variable dict    0.218999862671
Claudio variable dict          2.48400020599
Claudio fixed dict             0.0940001010895
Andrew Y variable dict         0.0309998989105
Andrew Y fixed dict            0.0310001373291
Run Code Online (Sandbox Code Playgroud)

claudiu和安德鲁的解决方案都保持在0,所以我不得不将其增加到10000次.

我在Python 3中运行它(因为unicode),用39到1024的字符替换(38是&符号,所以我不想包含它).字符串长度高达10.000,包括大约980个替换,长度为0-20的可变随机插入.从39到1024的unicode值会导致1和2字节长度的字符,这可能会影响某些解决方案.

mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr)))

# Tor Valamo - too long
#t = time()
#for x in range(rep):
#    for k, v in mydict.items():
#        mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)

# Claudiu - too long
#def multiple_replace(dict, text): 
#    # Create a regular expression  from the dictionary keys
#    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
#    # For each match, look-up corresponding value in dictionary
#    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
#    multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)

# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

def mysubst2(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
    mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)

# Andrew Y - fixed
def repl(s):
  return mydict[s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)

# gnibbler
t = time()
for x in range(rep):
    myparts = mystr.split("&")
    myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
    "".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)
Run Code Online (Sandbox Code Playgroud)

结果:

Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict      0.0 # disqualified 329 secs
Peter fixed & variable dict    2.07799983025
Peter fixed dict               1.53100013733 
Claudio variable dict          0.0 # disqualified, 37 secs
Claudio fixed dict             1.5
Andrew Y variable dict         0.578000068665
Andrew Y variable dict 2       0.56299996376
Andrew Y fixed dict            0.56200003624
gnibbler fixed & variable dict 0.530999898911
Run Code Online (Sandbox Code Playgroud)

(**请注意,gnibbler的代码使用不同的字典,其中键没有包含'&'.安德鲁的代码也使用这个替代字典,但它没有太大的区别,可能只是0.01倍的加速.)

  • 你对Claudiu不太公平.首先,您将其作为函数调用,并且函数调用开销在Python中无法忽略.其次,他的编译步骤不会每次都完成,而是在程序启动时完成. (4认同)

Pet*_*sen 14

试试这个,利用正则表达式替换和标准字符串格式化:

# using your stated values for str and dict:
>>> import re
>>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str)
>>> str % dict
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'
Run Code Online (Sandbox Code Playgroud)

re.sub()调用替换所有的&符号序列,后跟单个字母,模式%(..)s包含相同的模式.

%格式化利用了字符串格式的功能,可以使字典指定替换,而不是更常出现的位置参数.

替代方法可以使用回调直接在re.sub中执行此操作:

>>> import re
>>> def dictsub(m):
>>>    return dict[m.group()]
>>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)
Run Code Online (Sandbox Code Playgroud)

这次我使用闭包从回调函数内部引用字典.这种方法可以为您提供更多灵活性.例如,dict.get(m.group(), '??')如果您的字符串包含无法识别的代码序列,则可以使用类似的方法来避免引发异常.

(顺便说一句,"dict"和"str"都是内置函数,如果你在自己的代码中使用这些名称,你会遇到麻烦.以防万一你不知道.他们没有问题.这样的问题当然.)

编辑: 我决定检查Tor的测试代码,并得出结论,它几乎没有代表性,实际上是马车.生成的字符串甚至没有&符号(!).下面修订的代码生成代表性字典和字符串,类似于OP的示例输入.

我还想验证每个算法的输出是否相同.下面是一个修改过的测试程序,只有Tor的,我的和Claudiu的代码 - 因为其他人打破了样本输入.(我认为它们都很脆弱,除非字典基本上映射了所有可能的&符号序列,Tor的测试代码正在进行.)这个正确地为随机数生成器播种,因此每次运行都是相同的.最后,我使用生成器添加了一个小变化,避免了一些函数调用开销,从而提高了性能.

from time import time
import string
import random
import re

random.seed(1919096)  # ensure consistent runs

# build dictionary with 40 mappings, representative of original question
mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40))
# build simulated input, with mix of text, spaces, ampersands in reasonable proportions
letters = string.letters + ' ' * 12 + '&' * 6
mystr = ''.join(random.choice(letters) for i in range(1000))

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and %d ampersands'
    % (rep, len(mystr), mystr.count('&')))

# Tor Valamo
# fixed from Tor's test, so it actually builds up the final string properly
t = time()
for x in range(rep):
    output = mystr
    for k, v in mydict.items():
        output = output.replace(k, v)
print('%-30s' % 'Tor fixed & variable dict', time() - t)
# capture "known good" output as expected, to verify others
expected = output

# Peter Hansen

# build charset to use in regex for safe dict lookup
charset = ''.join(x[1] for x in mydict.keys())
# grab reference to method on regex, for speed
patsub = re.compile(r'(&[%s])' % charset).sub

t = time()
for x in range(rep):
    output = patsub(r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
assert output == expected

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time() - t)
assert output == expected

# Peter 3 - freaky generator version, to avoid function call overhead
def dictsub(d):
    m = yield None
    while 1:
        m = yield d[m.group()]

dictsub = dictsub(mydict).send
dictsub(None)   # "prime" it
t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter generator', time() - t)
assert output == expected

# Claudiu - Precompiled
regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub

t = time()
for x in range(rep):
    output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time() - t)
assert output == expected
Run Code Online (Sandbox Code Playgroud)

我之前忘了包含基准测试结果:

    Running 10000 times with string length 1000 and 96 ampersands
    ('Tor fixed & variable dict     ', 2.9890000820159912)
    ('Peter fixed & variable dict   ', 2.6659998893737793)
    ('Peter fixed dict              ', 1.0920000076293945)
    ('Peter generator               ', 1.0460000038146973)
    ('Claudio fixed dict            ', 1.562000036239624)

此外,输入的片段和正确的输出:

mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...}
output = 'lTEQDMAPvksk k?[0;57m Txp vrnhQ GHaO?[0;67mNFY&&a P...'
Run Code Online (Sandbox Code Playgroud)

与我从Tor的测试代码输出中看到的相比:

mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
output = # same as mystr since there were no ampersands inside
Run Code Online (Sandbox Code Playgroud)

  • OP没有说明他需要防弹.他可能会说"保证字符串中的所有序列都在字典中".如果从StackOverflow中删除了没有完美错误处理的每个答案,那么只剩下少数几个...... (2认同)
  • 有时我非常希望代码"完全无法完成最小的错误".如果没有,我不会在我的程序的*other*部分找到问题,该部分首先插入了ampserand转义序列.当然,我对其他部分的单元测试告诉我它只生成所显示字典所涵盖的那些模式,所以我知道我不需要冗余的错误处理,你正在谈论添加到我干净的程序,给我带来负担额外的维护费用.(正如您所看到的,有些人会考虑不必要的错误处理.) (2认同)

Wol*_*ngP 8

如果你真的想深入研究这个话题,请看看:http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

迭代字典并替换字符串中的每个元素的显而易见的解决方案需要O(n*m)时间,其中n是字典的大小,m是字符串的长度.

而Aho-Corasick算法找到字典的所有条目,O(n+m+f)其中f是找到的元素的数量.


And*_*w Y 6

如果列表中的键数很大,并且字符串中出现的次数很少(并且大部分为零),那么您可以迭代字符串中&符号的出现,并使用第一个键入的字典子串的字符.我不经常在python中编码所以风格可能有点偏,但这是我的看法:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

dict = {"&y":"\033[0;30m",
        "&c":"\033[0;31m",
        "&b":"\033[0;32m",
        "&Y":"\033[0;33m",
        "&u":"\033[0;34m"}

def rep(s):
  return dict["&"+s[0:1]] + s[1:]

subs = str.split("&")
res = subs[0] + "".join(map(rep, subs[1:]))

print res
Run Code Online (Sandbox Code Playgroud)

当然有一个问题是当一个&符号来自字符串本身时会发生什么,你需要在通过这个过程之前以某种方式逃避它,然后在这个过程之后进行转换.

当然,就像性能问题一样,在典型(也是最糟糕的情况下)数据集上对各种方法进行计时并比较它们是一件好事.

编辑:将它放入一个单独的函数来处理任意字典:

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
Run Code Online (Sandbox Code Playgroud)

EDIT2:摆脱不必要的连接,在许多迭代中似乎仍然比前一个快一点.

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
Run Code Online (Sandbox Code Playgroud)