为什么我的正则表达式如此懒惰?

BFT*_*ick 1 regex

为什么这个正则表达式如此懒惰?它应该返回引用高度/宽度属性,介于两者之间(可选),然后是另一个高度/宽度属性(可选).它只获得第一个属性,然后即使它可以匹配更多也退出.

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?
Run Code Online (Sandbox Code Playgroud)

regexpal上的示例代码

Rob*_*t P 6

查看正在发生的事情的最简单方法是将其分解为扩展格式.在扩展格式中,你的正则表达式......

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?
Run Code Online (Sandbox Code Playgroud)

然后变成(带有评论,扩展格式合法):

(                     # a group that captures...
    (?:height|width)  # Height or width
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(                     # zero or more groups that capture...space chars, 
    [\s\w:;'"=]       # letters, numbers, colon, quote, dobule quote, and equals 
)*?                   # zero or more times, lazily (giving up as much as it can)
(                     # a group that...
    (?:height|width)  # height or width
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optionally
Run Code Online (Sandbox Code Playgroud)

因此,根据您正在使用的正则表达式引擎,您的正则表达式可能会生成1个组,最多可生成N个组.你的最后一组将是你想要的小组,如果有的话.删除第二组(the ?)的延迟修饰符并使第二组不捕获,如下所示:

(                     # a group that captures...
    (?:height|width)  # Height or width (non capturing)
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(?:                   # zero or more groups of space chars, letters, 
    [\s\w:;'"=]       # numbers, colon, quote, dobule quote, and equals 
)*                    # zero or more times as much as it can UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optional
Run Code Online (Sandbox Code Playgroud)

现在第一个和最后一个标签将分别在第1组和第2组中,中间的内容被忽略.如果有最后一个,它将被捕获.

注意:它可能没有捕获最后一部分,因为没有指定需要在中间组中捕获的字符.如果有一个逗号,一个#或任何其他类型的标记字符,则它们不会被该中间组的字符类指定.您可以考虑用以下内容替换中间的:

    ["']              # a double quote or quote
)                     # requried
.*                    # Anything, zero or more times, UNTIL...
(                     # a group that...
    (?:height|width)  # height or width (non-capturing)
Run Code Online (Sandbox Code Playgroud)

并查看该DOES是否匹配.如果是,您可能需要进一步增强中间组的角色.

如果您不关心中间组中发生了多少匹配,只需捕获它,使用非捕获组捕获每个子集,然后使用组捕获整个中间组集合:

    ["']              # a double quote or quote
)                     # requried
(                     # a group that captures...
    (?:               # zero or more groups of space chars, letters, 
        [\s\w:;'"=]   # numbers, colon, quote, dobule quote, and equals 
    )*                # zero or more times as much as it can
)                     # UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)
Run Code Online (Sandbox Code Playgroud)

现在你将获得固定数量的捕获,第一部分总是在第1组中,中间部分总是在第2组中,最后一部分(如果它在那里)在第3组中.