Abh*_*bhi 9 python regex python-2.7
我目前正在使用python的re模块来搜索和捕获组.我列出了正则表达式,我必须编译并匹配导致性能问题的大型数据集.
Example:
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$'
.
.
.
.
]
Run Code Online (Sandbox Code Playgroud)
注意:正则表达式不会相似
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
def find_match(string):
for regex in COMPILED_REGEXES:
match = regex.search(string)
if not match:
continue
return match
Run Code Online (Sandbox Code Playgroud)
有没有解决的办法?我们的想法是避免通过编译的正则表达式进行迭代以获得匹配.
你的任何正则表达式是否会破坏DFA兼容性?在你的例子中看起来不像.您可以使用围绕C/C++ DFA实现的Python包装器,例如re2,它可以代替re.如果正则表达式与语法不兼容,re2也将回退使用,因此它将优化所有可能的情况,而不是在不兼容的情况下失败.rere2
请注意,re2 它确实支持(?P<name>regex)捕获语法,但它不支持(?P=<name>)backref sytnax.
try:
import re2 as re
re.set_fallback_notification(re.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
else:
import re
Run Code Online (Sandbox Code Playgroud)
如果你有backrefs的正则表达式,你仍然可以使用re2一些特殊注意事项:你需要用你的regexp替换backrefs .*?,你可能会发现可以过滤掉的错误匹配re.在现实世界数据中,错误匹配可能不常见.
这是一个说明性的例子:
import re
try:
import re2
re2.set_fallback_notification(re2.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$',
]
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
# replace all backrefs with .*? for re2 compatibility
# is there other unsupported syntax in REGEXES?
COMPILED_REGEXES_DFA = [re2.compile(re2.sub(r'\\d|\\g\\d|\\g\<\d+\>|\\g\<\w+\>', '.*?', r), flags=re2.I) for r in REGEXES]
def find_match(string):
for regex, regex_dfa in zip(COMPILED_REGEXES, COMPILED_REGEXES_DFA):
match_dfa = regex_dfa.search(string)
if not match_dfa:
continue
match = regex.search(string)
# most likely branch comes first for better branch prediction
if match:
return match
Run Code Online (Sandbox Code Playgroud)
如果这还不够快,您可以使用各种技术在处理DFA命中re时将其存储,而不是将它们存储在文件或内存中,并在收集完毕后将其关闭.
您还可以将所有正则表达式组合成一个交替组的大DFA正则表达式,(r1)|(r2)|(r3)| ... |(rN)并在生成的匹配对象上迭代组匹配,以尝试仅匹配相应的原始正则表达式.匹配结果对象将具有与OP的原始解决方案相同的状态.
# rename group names in regexeps to avoid name collisions
REGEXES_PREFIXED = [re2.sub(r'\(\?P\<(\w+)\>', r'(P<re{}_\1>'.format(idx), r) for idx, r in enumerate(REGEXES)]
# wrap and fold regexps (?P<hit0>pattern)| ... |(?P<hitN>pattern)
REGEX_BIG = ''
for idx, r in enumerate(REGEXES_PREFIXED):
REGEX_BIG += '(?P<hit{}>{})|'.format(idx, r)
else:
REGEX_BIG = REGEX_BIG[0:-1]
regex_dfa_big = re2.compile(REGEX_BIG, flags = re2.I)
def find_match(string):
match_dfa = regex_dfa_big.search(string)
if match_dfa:
# only interested in hit# match groups
hits = [n for n, _ in match_dfa.groupdict().iteritems() if re2.match(r'hit\d+', n)]
# check for false positives
for idx in [int(h.replace('hit', '')) for h in hits]
match = COMPILED_REGEXES[idx].search(string)
if match:
return match
Run Code Online (Sandbox Code Playgroud)
您还可以查看pyre,它是同一个C++库的更好维护包装器,但不是替代品re.还有一个用于RuRe的Python包装器,它是我所知道的最快的正则表达式引擎.
详细说明我的意见:把它放在一个大的正则表达式中的问题是组名必须是唯一的.但是,您可以按如下方式处理正则表达式:
import re
REGEXES = [
r'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
r'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
r'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
r'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
r'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$']
# Find the names of groups in the regexps
groupnames = {'RE_%s'%i:re.findall(r'\(\?P<([^>]+)>', r) for i, r in enumerate(REGEXES)}
# Convert the named groups into unnamed ones
re_list_cleaned = [re.sub(r'\?P<([^>]+)>', '', r) for r in REGEXES]
# Wrap each regexp in a named group
token_re_list = ['(?P<RE_%s>%s)'%(i, r) for i, r in enumerate(re_list_cleaned)]
# Put them all together
mighty_re = re.compile('|'.join(token_re_list), re.MULTILINE)
# Use the regexp to process a big file
with open('bigfile.txt') as f:
txt = f.read()
for match in mighty_re.finditer(txt):
# Now find out which regexp made the match and put the matched data in a dictionary
re_name = match.lastgroup
groups = [g for g in match.groups() if g is not None]
gn = groupnames[re_name]
matchdict = dict(zip(gn, groups[1:]))
print ('Found:', re_name, matchdict)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1038 次 |
| 最近记录: |