我试图在Python(2.7)中使用详细的正则表达式.如果重要的话,我只是想让它更容易回归,并在未来的某个时候更清楚地理解表达.因为我是新手,所以我首先创建了一个紧凑的表达式,以确保我得到了我想要的东西.
这是紧凑的表达式:
test_verbose_item_pattern = re.compile('\n{1}\b?I[tT][eE][mM]\s+\d{1,2}\.?\(?[a-e]?\)?.*[^0-9]\n{1}')
Run Code Online (Sandbox Code Playgroud)
它按预期工作
这是详细表达式
verbose_item_pattern = re.compile("""
\n{1} #begin with a new line allow only one new line character
\b? #allow for a word boundary the ? allows 0 or 1 word boundaries \nITEM or \n ITEM
I # the first word on the line must begin with a capital I
[tT][eE][mM] #then we need one character from each of the three sets this allows for unknown case
\s+ # one or more white spaces this does allow for another \n not sure if I should change it
\d{1,2} # require one or two digits
\.? # there could be 0 or 1 periods after the digits 1. or 1
\(? # there might be 0 or 1 instance of an open paren
[a-e]? # there could be 0 or 1 instance of a letter in the range a-e
\)? # there could be 0 or 1 instance of a closing paren
.* #any number of unknown characters so we can have words and punctuation
[^0-9] # by its placement I am hoping that I am stating that I do not want to allow strings that end with a number and then \n
\n{1} #I want to cut it off at the next newline character
""",re.VERBOSE)
Run Code Online (Sandbox Code Playgroud)
问题是,当我运行详细模式时,我得到一个异常
Traceback (most recent call last):
File "C:/Users/Dropbox/directEDGAR-Code-Examples/NewItemIdentifier.py", line 17, in <module>
""",re.VERBOSE)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: nothing to repeat
Run Code Online (Sandbox Code Playgroud)
我担心这会有些愚蠢,但我无法弄明白.我确实采用了我的详细表达并逐行压缩,以确保紧凑版本与详细版本相同.
错误消息表明没有什么可重复的?
unu*_*tbu 13
在定义正则表达式模式时使用原始字符串文字是一个好习惯.许多正则表达式模式使用反斜杠,并且使用原始字符串文字将允许您编写单个反斜杠,而不必担心Python是否会将您的反斜杠解释为具有不同的含义(并且在这些情况下必须使用两个反斜杠) ).
\b?是无效的正则表达式.这是说0或1字边界.但要么你有一个单词边界,要么你没有.如果你有一个单词边界,那么你有一个单词边界.如果你没有单词边界,那么你有0个单词边界.那么\b?(如果它是有效的正则表达式)总是如此.
正则表达式区分字符串的结尾和行的结尾.(一个字符串可能由多行组成.)
\A 仅匹配字符串的开头.\Z 仅匹配字符串的结尾.$ 匹配字符串的结尾,以及re.MULTILINE模式中的行尾.^ 匹配字符串的开头,并在re.MULTILINE模式下开始一行.import re
verbose_item_pattern = re.compile(r"""
$ # end of line boundary
\s{1,2} # 1-or-2 whitespace character, including the newline
I # a capital I
[tT][eE][mM] # one character from each of the three sets this allows for unknown case
\s+ # 1-or-more whitespaces INCLUDING newline
\d{1,2} # 1-or-2 digits
[.]? # 0-or-1 literal .
\(? # 0-or-1 literal open paren
[a-e]? # 0-or-1 letter in the range a-e
\)? # 0-or-1 closing paren
.* # any number of unknown characters so we can have words and punctuation
[^0-9] # anything but [0-9]
$ # end of line boundary
""", re.VERBOSE|re.MULTILINE)
x = verbose_item_pattern.search("""
Item 1.0(a) foo bar
""")
print(x)
Run Code Online (Sandbox Code Playgroud)
产量
<_sre.SRE_Match object at 0xb76dd020>
Run Code Online (Sandbox Code Playgroud)
(表示匹配)