Mik*_*cic 42 python regex string performance replace
假设我有一个看起来像这样的字符串:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
Run Code Online (Sandbox Code Playgroud)
您会注意到字符串中有很多位置,其中有一个&符号,后跟一个字符(例如"&y"和"&c").我需要用字典中的适当值替换这些字符,如下所示:
dict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
Run Code Online (Sandbox Code Playgroud)
最快的方法是什么?我可以手动找到所有的&符号,然后循环通过字典来改变它们,但这似乎很慢.做一堆正则表达式替换似乎也很慢(我的实际代码中将有一个大约30-40对的字典).
任何建议表示赞赏,谢谢.
编辑:
正如在这个问题的评论中所指出的,我的字典是在运行时之前定义的,并且在应用程序生命周期的过程中永远不会改变.它是ANSI转义序列的列表,其中将包含大约40个项目.我要比较的平均字符串长度大约为500个字符,但最多可达5000个字符(但这些字符很少见).我目前也在使用Python 2.6.
编辑#2 我接受Tor Valamos的回答是正确的,因为它不仅提供了有效的解决方案(虽然它不是最好的解决方案),而是考虑了所有其他解决方案并做了大量的工作来比较所有这些.这个答案是我在StackOverflow上遇到过的最好,最有帮助的答案之一.感谢你.
Tor*_*amo 30
mydict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
for k, v in mydict.iteritems():
mystr = mystr.replace(k, v)
print mystr
The ?[0;30mquick ?[0;31mbrown ?[0;32mfox ?[0;33mjumps over the ?[0;34mlazy dog
Run Code Online (Sandbox Code Playgroud)
我冒昧地比较了几个解决方案:
mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])
# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
from time import time
# How many times to run each solution
rep = 10000
print 'Running %d times with string length %d and ' \
'random inserts of lengths 0-20' % (rep, len(mystr))
# My solution
t = time()
for x in range(rep):
for k, v in mydict.items():
mystr.replace(k, v)
#print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t
from re import sub, compile, escape
# Peter Hansen
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t
# Claudiu
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
t = time()
for x in range(rep):
multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t
# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
t = time()
for x in range(rep):
regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t
# Andrew Y - variable dict
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
t = time()
for x in range(rep):
mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t
# Andrew Y - fixed
def repl(s):
return mydict["&"+s[0:1]] + s[1:]
t = time()
for x in range(rep):
subs = mystr.split("&")
res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t
Run Code Online (Sandbox Code Playgroud)
Python 2.6中的结果
Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict 1.04699993134
Peter fixed & variable dict 0.218999862671
Claudio variable dict 2.48400020599
Claudio fixed dict 0.0940001010895
Andrew Y variable dict 0.0309998989105
Andrew Y fixed dict 0.0310001373291
Run Code Online (Sandbox Code Playgroud)
claudiu和安德鲁的解决方案都保持在0,所以我不得不将其增加到10000次.
我在Python 3中运行它(因为unicode),用39到1024的字符替换(38是&符号,所以我不想包含它).字符串长度高达10.000,包括大约980个替换,长度为0-20的可变随机插入.从39到1024的unicode值会导致1和2字节长度的字符,这可能会影响某些解决方案.
mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])
# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
from time import time
# How many times to run each solution
rep = 10000
print('Running %d times with string length %d and ' \
'random inserts of lengths 0-20' % (rep, len(mystr)))
# Tor Valamo - too long
#t = time()
#for x in range(rep):
# for k, v in mydict.items():
# mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)
from re import sub, compile, escape
# Peter Hansen
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
# Peter 2
def dictsub(m):
return mydict[m.group()]
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)
# Claudiu - too long
#def multiple_replace(dict, text):
# # Create a regular expression from the dictionary keys
# regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
# # For each match, look-up corresponding value in dictionary
# return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
# multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)
# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
t = time()
for x in range(rep):
regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)
# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())
# Andrew Y - variable dict
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
def mysubst2(somestr, somedict):
subs = somestr.split("&")
return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
t = time()
for x in range(rep):
mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)
# Andrew Y - fixed
def repl(s):
return mydict[s[0:1]] + s[1:]
t = time()
for x in range(rep):
subs = mystr.split("&")
res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)
# gnibbler
t = time()
for x in range(rep):
myparts = mystr.split("&")
myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
"".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)
Run Code Online (Sandbox Code Playgroud)
结果:
Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict 0.0 # disqualified 329 secs
Peter fixed & variable dict 2.07799983025
Peter fixed dict 1.53100013733
Claudio variable dict 0.0 # disqualified, 37 secs
Claudio fixed dict 1.5
Andrew Y variable dict 0.578000068665
Andrew Y variable dict 2 0.56299996376
Andrew Y fixed dict 0.56200003624
gnibbler fixed & variable dict 0.530999898911
Run Code Online (Sandbox Code Playgroud)
(**请注意,gnibbler的代码使用不同的字典,其中键没有包含'&'.安德鲁的代码也使用这个替代字典,但它没有太大的区别,可能只是0.01倍的加速.)
Pet*_*sen 14
试试这个,利用正则表达式替换和标准字符串格式化:
# using your stated values for str and dict:
>>> import re
>>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str)
>>> str % dict
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'
Run Code Online (Sandbox Code Playgroud)
re.sub()调用替换所有的&符号序列,后跟单个字母,模式%(..)s包含相同的模式.
%格式化利用了字符串格式的功能,可以使字典指定替换,而不是更常出现的位置参数.
替代方法可以使用回调直接在re.sub中执行此操作:
>>> import re
>>> def dictsub(m):
>>> return dict[m.group()]
>>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)
Run Code Online (Sandbox Code Playgroud)
这次我使用闭包从回调函数内部引用字典.这种方法可以为您提供更多灵活性.例如,dict.get(m.group(), '??')如果您的字符串包含无法识别的代码序列,则可以使用类似的方法来避免引发异常.
(顺便说一句,"dict"和"str"都是内置函数,如果你在自己的代码中使用这些名称,你会遇到麻烦.以防万一你不知道.他们没有问题.这样的问题当然.)
编辑: 我决定检查Tor的测试代码,并得出结论,它几乎没有代表性,实际上是马车.生成的字符串甚至没有&符号(!).下面修订的代码生成代表性字典和字符串,类似于OP的示例输入.
我还想验证每个算法的输出是否相同.下面是一个修改过的测试程序,只有Tor的,我的和Claudiu的代码 - 因为其他人打破了样本输入.(我认为它们都很脆弱,除非字典基本上映射了所有可能的&符号序列,Tor的测试代码正在进行.)这个正确地为随机数生成器播种,因此每次运行都是相同的.最后,我使用生成器添加了一个小变化,避免了一些函数调用开销,从而提高了性能.
from time import time
import string
import random
import re
random.seed(1919096) # ensure consistent runs
# build dictionary with 40 mappings, representative of original question
mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40))
# build simulated input, with mix of text, spaces, ampersands in reasonable proportions
letters = string.letters + ' ' * 12 + '&' * 6
mystr = ''.join(random.choice(letters) for i in range(1000))
# How many times to run each solution
rep = 10000
print('Running %d times with string length %d and %d ampersands'
% (rep, len(mystr), mystr.count('&')))
# Tor Valamo
# fixed from Tor's test, so it actually builds up the final string properly
t = time()
for x in range(rep):
output = mystr
for k, v in mydict.items():
output = output.replace(k, v)
print('%-30s' % 'Tor fixed & variable dict', time() - t)
# capture "known good" output as expected, to verify others
expected = output
# Peter Hansen
# build charset to use in regex for safe dict lookup
charset = ''.join(x[1] for x in mydict.keys())
# grab reference to method on regex, for speed
patsub = re.compile(r'(&[%s])' % charset).sub
t = time()
for x in range(rep):
output = patsub(r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
assert output == expected
# Peter 2
def dictsub(m):
return mydict[m.group()]
t = time()
for x in range(rep):
output = patsub(dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time() - t)
assert output == expected
# Peter 3 - freaky generator version, to avoid function call overhead
def dictsub(d):
m = yield None
while 1:
m = yield d[m.group()]
dictsub = dictsub(mydict).send
dictsub(None) # "prime" it
t = time()
for x in range(rep):
output = patsub(dictsub, mystr)
print('%-30s' % 'Peter generator', time() - t)
assert output == expected
# Claudiu - Precompiled
regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub
t = time()
for x in range(rep):
output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time() - t)
assert output == expected
Run Code Online (Sandbox Code Playgroud)
我之前忘了包含基准测试结果:
Running 10000 times with string length 1000 and 96 ampersands
('Tor fixed & variable dict ', 2.9890000820159912)
('Peter fixed & variable dict ', 2.6659998893737793)
('Peter fixed dict ', 1.0920000076293945)
('Peter generator ', 1.0460000038146973)
('Claudio fixed dict ', 1.562000036239624)
此外,输入的片段和正确的输出:
mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...}
output = 'lTEQDMAPvksk k?[0;57m Txp vrnhQ GHaO?[0;67mNFY&&a P...'
Run Code Online (Sandbox Code Playgroud)
与我从Tor的测试代码输出中看到的相比:
mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
output = # same as mystr since there were no ampersands inside
Run Code Online (Sandbox Code Playgroud)
如果你真的想深入研究这个话题,请看看:http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
迭代字典并替换字符串中的每个元素的显而易见的解决方案需要O(n*m)时间,其中n是字典的大小,m是字符串的长度.
而Aho-Corasick算法找到字典的所有条目,O(n+m+f)其中f是找到的元素的数量.
如果列表中的键数很大,并且字符串中出现的次数很少(并且大部分为零),那么您可以迭代字符串中&符号的出现,并使用第一个键入的字典子串的字符.我不经常在python中编码所以风格可能有点偏,但这是我的看法:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
dict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
def rep(s):
return dict["&"+s[0:1]] + s[1:]
subs = str.split("&")
res = subs[0] + "".join(map(rep, subs[1:]))
print res
Run Code Online (Sandbox Code Playgroud)
当然有一个问题是当一个&符号来自字符串本身时会发生什么,你需要在通过这个过程之前以某种方式逃避它,然后在这个过程之后进行转换.
当然,就像性能问题一样,在典型(也是最糟糕的情况下)数据集上对各种方法进行计时并比较它们是一件好事.
编辑:将它放入一个单独的函数来处理任意字典:
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
Run Code Online (Sandbox Code Playgroud)
EDIT2:摆脱不必要的连接,在许多迭代中似乎仍然比前一个快一点.
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
18800 次 |
| 最近记录: |