如何在Python中将长正则表达式规则拆分为多行

Mak*_*kis 34 python regex

这实际上是可行的吗?我有一些非常长的正则表达式模式规则很难理解,因为它们不能同时适应屏幕.例:

test = re.compile('(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented' % (self.__MEMBER_TYPES), re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

反斜杠或三重引号不起作用.

编辑.我结束使用VERBOSE模式.以下是正则表达式模式现在的样子:

test = re.compile('''
  (?P<full_path>                                  # Capture a group called full_path
    .+                                            #   It consists of one more characters of any type
  )                                               # Group ends                      
  :                                               # A literal colon
  \d+                                             # One or more numbers (line number)
  :                                               # A literal colon
  \s+warning:\s+parameters\sof\smember\s+         # An almost static string
  (?P<member_name>                                # Capture a group called member_name
    [                                             #   
      ^:                                          #   Match anything but a colon (so finding a colon ends group)
    ]+                                            #   Match one or more characters
   )                                              # Group ends
   (                                              # Start an unnamed group 
     ::                                           #   Two literal colons
     (?P<function_name>                           #   Start another group called function_name
       \w+                                        #     It consists on one or more alphanumeric characters
     )                                            #   End group
   )*                                             # This group is entirely optional and does not apply to C
   \s+are\snot\s\(all\)\sdocumented''',           # And line ends with an almost static string
   re.IGNORECASE|re.VERBOSE)                      # Let's not worry about case, because it seems to differ between Doxygen versions
Run Code Online (Sandbox Code Playgroud)

nae*_*aeg 41

您可以通过引用每个段来拆分正则表达式模式.不需要反斜杠.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
                   '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
                   'of (class|group|namespace)\s+(?P<class_name>.+)'
                   '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

您还可以使用原始字符串标志,'r'并且必须在每个段之前放置它.

查看文档.


N3d*_*st4 22

来自http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation:

允许使用多个相邻的字符串文字(由空格分隔),可能使用不同的引用约定,并且它们的含义与它们的连接相同.因此,"你好"'世界'相当于"helloworld".此功能可用于减少所需的反斜杠数,在长行中方便地拆分长字符串,甚至可以为字符串的某些部分添加注释,例如:

re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )
Run Code Online (Sandbox Code Playgroud)

请注意,此功能是在语法级别定义的,但在编译时实现.必须使用'+'运算符在运行时连接字符串表达式.另请注意,文字串联可以为每个组件使用不同的引用样式(甚至混合原始字符串和三重引用的字符串).


eyq*_*uem 7

就个人而言,我不使用re.VERBOSE,因为我不喜欢逃跑的空格,我不希望把"\ S",而不是空格时"\ S"不是必需的.
正则表达式模式中的符号相对于必须捕获的字符序列更精确,正则表达式对象的行为越快.我几乎从不使用'\ s'

.

为了避免re.VERBOSE,你可以这样做,因为它已经说过:

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' # comment
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented'\
% (self.__MEMBER_TYPES),

re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

将字符串向左推,可以为写注释提供大量空间.

.

但是当模式很长时,这种方式不太好,因为它不可能写

test = re.compile(
'(?P<full_path>.+)'
':\d+:\s+warning:\s+Member\s+' # comment
'(?P<member_name>.+)'
'\s+\('
'(?P<member_type>%s)' % (self.__MEMBER_TYPES)  # !!!!!! INCORRECT SYNTAX !!!!!!!
'\) of '
'(class|group|namespace)'
#      ^^^^^^ underlining something to point out
'\s+'
'(?P<class_name>.+)'
#      vvv overlining something important too
'\s+is not documented',

re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

然后,如果图案很长,那么结尾处
的部分 和应用它的字符串之间的 线数 可能很大,并且我们放松了阅读图案的容易程度.% (self.__MEMBER_TYPES)
'(?P<member_type>%s)'

.

这就是为什么我喜欢使用元组来编写一个非常长的模式:

pat = ''.join((
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % (self.__MEMBER_TYPES), # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))
Run Code Online (Sandbox Code Playgroud)

.

这种方式允许将模式定义为函数:

def pat(x):

    return ''.join((\
'(?P<full_path>.+)',
# you can put a comment here, you see: a very very very long comment
':\d+:\s+warning:\s+Member\s+',
'(?P<member_name>.+)',
'\s+\(',
'(?P<member_type>%s)' % x , # comment here
'\) of ',
# comment here
'(class|group|namespace)',
#       ^^^^^^ underlining something to point out
'\s+',
'(?P<class_name>.+)',
#      vvv overlining something important too
'\s+is not documented'))

test = re.compile(pat(self.__MEMBER_TYPES), re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)


Tho*_*est 6

为了完整起见,这里缺少的答案是使用OP最终指出的re.Xre.VERBOSE标志.除了保存引号外,此方法还可以在其他正则表达式实现(如Perl)上移植.

来自https://docs.python.org/2/library/re.html#re.X :

re.X
re.VERBOSE
Run Code Online (Sandbox Code Playgroud)

这个标志可以让你编写更好看,更有可读性允许您图案在视觉上独立的逻辑部分并添加注释的正则表达式.模式中的空格被忽略,除非在字符类中或前面有未转义的反斜杠.如果一行包含一个不在字符类中的#并且前面没有未转义的反斜杠,则忽略最左边的#到行尾的所有字符.

这意味着匹配十进制数的以下两个正则表达式对象在功能上是相等的:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
Run Code Online (Sandbox Code Playgroud)

 

b = re.compile(r"\d+\.\d*")
Run Code Online (Sandbox Code Playgroud)