RegEx HTML与懒惰通配符匹配太多

bra*_*ido 0 html regex vbscript non-greedy regex-greedy

正则表达式:

<span style='.+?'>TheTextToFind</span>
Run Code Online (Sandbox Code Playgroud)

HTML:

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span></span>
Run Code Online (Sandbox Code Playgroud)

为什么比赛包括这个?

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED
Run Code Online (Sandbox Code Playgroud)

示例链接

nha*_*tdh 5

正则表达式引擎总是找到最左边的匹配.这就是你得到的原因

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span>
Run Code Online (Sandbox Code Playgroud)

作为一个匹配.(基本上是整个输入,没有最后一个</span>).

要使引擎以正确的方向转向,如果我们假设它>不直接出现在属性中,则以下正则表达式将匹配您想要的.

<span style='[^>]+'>TheTextToFind</span>
Run Code Online (Sandbox Code Playgroud)

这个正则表达式匹配您想要的,因为在上述假设下,[^>]+无法在标记之外匹配.

但是,我希望您不要将此作为从HTML页面中提取信息的程序的一部分.为此目的使用HTML解析器.


要理解为什么正则表达式匹配,你需要了解.+?它将尝试回溯,以便它可以找到续集('>TheTextToFind</span>)的匹配.

# Matching .+?
# Since +? is lazy, it matches . once (to fulfill the minimum repetition), and
# increase the number of repetition if the sequel fails to match
<span style='f                        # FAIL. Can't match closing '
<span style='fo                       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;        # PROCEED. But FAIL later, since can't match T in The
<span style='font-size:11.0pt;'       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;'>DON'  # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='
                                      # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;
                                      # PROCEED. MATCH FOUND.
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,.+?尝试增加长度和匹配font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;,这使得续集 '>TheTextToFind</span>能够匹配.