建议正则表达式中的子串应根据长度排序的原因是什么？

Question

最长的

>>> p = re.compile('supermanutd|supermanu|superman|superm|super')

最短的

>>> p = re.compile('super|superm|superman|supermanu|supermanutd')

为什么最长的第一个正则表达式首选？

Answer 1

正则表达式中的替代项按您提供的顺序进行测试,因此如果第一个分支匹配,则Rx不会检查其他分支.如果您只需要测试匹配,这无关紧要,但如果您想根据匹配提取文本,那么这很重要.

当较短的字符串是较长字符串的子字符串时,您只需按长度排序.例如,当你有文字时:

supermanutd
supermanu
superman
superm

然后你的第一个Rx你会得到:

>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']

但是第二个Rx:

>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']

使用http://www.pythonregex.com/测试你的正则表达式