这个正则表达式不应该发生灾难性的回溯

Question

这个正则表达式不应该发生灾难性的回溯

有人可以解释为什么Java的正则表达式引擎会在这个正则表达式中进入灾难性的回溯模式吗？根据我的判断,每次交替都是互相排斥的.

^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
\"(?:[^\"]+|\"\")+\"|
'(?:[^']+|'')+')

Run Code Online (Sandbox Code Playgroud)

文本: 'pão de açúcar itaucard mastercard platinum SUSTENTABILIDADE])

为一些替换添加所有格匹配修复了问题,但我不知道为什么 - Java的正则表达式lib必须非常错误才能在互斥分支上回溯.

 ^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
 \"(?:[^\"]++|\"\")++\"|
 '(?:[^']++|'')++')

Run Code Online (Sandbox Code Playgroud)

Answer 1

tch*_*ist 17

编辑:最后添加Java版本 - 尽管本质上是笨拙,不可读和不可维护.

¡没有更丑陋的模式!

你需要做的第一件事就是以一种能够承受人类可读性和可维护性的任何可能希望的方式编写你的正则表达式.您需要做的第二件事是对此进行分析,以了解它实际上在做什么.

这意味着您至少需要在Pattern.COMMENTS模式(或前缀"(?x)")中编译它,然后添加空格以提供一些视觉肘部空间.尽可能接近,你实际上想要匹配的模式是这样的:

^ 
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted
    [^"\s~:/@\#|^&\[\]()\\{}] *
  | " (?: [^"]+ | "" )+ "        # alt 2: double quoted
  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
)

Run Code Online (Sandbox Code Playgroud)

如你所见,我已经在可能的地方引入了垂直和水平空白,以引导眼睛和头脑作为一种认知分块.我还删除了所有无关的反斜杠.这些都是彻头彻尾的错误,或者是混淆器,除了使读者感到困惑之外什么都不做.

请注意,在应用垂直空白时,我已经在同一列中生成了从一行到下一行的相同部分,以便您可以立即看到哪些部分相同以及哪些部分不同.

完成后,我终于可以看到你在这里的一个匹配锚定到开始,然后选择三个选项.因此,我已经用描述性评论标记了这三个备选方案,因此无需猜测.

我还注意到你的第一个选择有两个微妙的不同(否定)方括号括号字符类.第二个缺少第一个中的单引号排除.这是故意的吗？即使是这样,我发现这对我的口味来说太过重复了; 部分或全部应该在变量中,这样您就不会冒更新一致性问题的风险.

剖析

你需要做的两件事中的第二件也是更重要的是对此进行分析.您需要准确查看正在编译该模式的正则表达式程序,并且需要在运行数据时跟踪其执行情况.

Java的Pattern类目前无法做到这一点,虽然我已经与OraSun的当前代码保管人谈了很长时间,并且他都热衷于将此功能添加到Java并认为他确切知道如何做到这一点.他甚至给我发了一个原型,它完成了第一部分:汇编.所以我希望有一天可以使用它.

与此同时,让我们转而使用一种工具,其中正则表达式是编程语言本身的一个组成部分,而不是作为一个尴尬的事后想法.虽然有几种语言符合这一标准,但在模式匹配方面,没有一种语言符合Perl所见的复杂程度.

这是一个等效的程序.

#!/usr/bin/env perl
use v5.10;      # first release with possessive matches
use utf8;       # we have utf8 literals
use strict;     # require variable declarations, etc
use warnings;   # check for boneheadedness

my $match = qr{
    ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}]
          [^"\s~:/@\#|^&\[\]()\\{}] *
        | " (?: [^"]+ | "" )+ "
        | ' (?: [^']+ | '' )+ '
    )
}x;

my $text = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";

my $count = 0;

while ($text =~ /$match/g) {
    print "Got it: $&\n";
    $count++;
}

if ($count == 0) {
    print "Match failed.\n";
}

Run Code Online (Sandbox Code Playgroud)

如果我们运行该程序,我们会得到匹配失败的预期答案.问题是为什么以及如何.

我们现在想看两件事:我们想看看模式编译的正则表达式程序,然后我们想跟踪该正则表达式程序的执行.

这两个都是用

use re "debug";

Run Code Online (Sandbox Code Playgroud)

pragma,也可以在命令行中指定-Mre=debug.这就是我们在这里要做的,以避免黑客攻击源代码.

正则表达式编译

在re调试编译通常会显示该模式的两个编译和执行.为了分离这些,我们可以使用Perl的"仅编译"开关-c,它不会尝试执行它编译的程序.这样我们所看到的就是编译模式.这些产生以下36行输出:

$ perl -c -Mre=debug /tmp/bt
Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (79)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (79)
        <"> (29)
  29:     CURLYX[0] {1,32767} (49)
  31:       BRANCH (44)
  32:         PLUS (48)
  33:           ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  44:       BRANCH (FAIL)
  45:         EXACT <""> (48)
  47:       TAIL (48)
  48:     WHILEM[1/2] (0)
  49:     NOTHING (50)
  50:     EXACT <"> (79)
        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)
  73:       TAIL (74)
  74:     WHILEM[2/2] (0)
  75:     NOTHING (76)
  76:     EXACT <'> (79)
  78: TAIL (79)
  79: END (0)
anchored(BOL) minlen 1 
/tmp/bt syntax OK
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

Run Code Online (Sandbox Code Playgroud)

如您所见,编译的正则表达式程序本身就是一种"正则表达式汇编语言".(它看起来非常类似于向我演示的Java原型吐出的内容,所以我想有一天你也会在Java中看到这种东西.)所有细节都不是必需的,但我要指出,节点2的指令是BRANCH,如果失败则继续执行分支26,另一个BRANCH.第二个BRANCH是正则表达式程序的唯一其他部分,由一个TRIE-EXACT节点组成,因为它知道备选方案具有不同的起始字符串.在我们将要讨论的那两个特里分支中会发生什么.

正则表达式执行

现在是时候看看它运行时会发生什么.您正在使用的文本字符串会导致相当多的回溯,这意味着在最终失败之前您将需要大量输出.多少输出？嗯,这么多:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
9987

Run Code Online (Sandbox Code Playgroud)

我认为10,000步是你所说的"灾难性回溯模式".让我们看看,我们不能把它简化为更易于理解的东西.您的输入字符串长度为61个字符.为了更好地了解正在发生的事情,我们可以将其修剪为just 'pão,只有4个字符.(好吧,在NFC中,也就是说;它是NFD中的五个代码点,但这里没有任何改变).这导致167行输出:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
167

Run Code Online (Sandbox Code Playgroud)

实际上,下面是正则表达式(编译加)执行分析的行,当你的字符串长这么多字符时你会得到:

chars   lines   string
    1     63   ‹'›
    2     78   ‹'p›  
    3    109   ‹'pã›
    4    167   ‹'pão› 
    5    290   ‹'pão ›
    6    389   ‹'pão d›
    7    487   ‹'pão de›
    8    546   ‹'pão de ›
    9    615   ‹'pão de a›
   10    722   ‹'pão de aç›
  ....
   61   9987   ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])›

Run Code Online (Sandbox Code Playgroud)

Let’s look at the debugging output when the string is the four characters 'pão. I’ve omitted the compilation part this time to show only the execution part:

$ perl -Mre=debug /tmp/bt
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o"
UTF-8 string...
   0 <> <'p%x{e3}o>  |  1:BOL(2)
   0 <> <'p%x{e3}o>  |  2:BRANCH(26)
   0 <> <'p%x{e3}o>  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o>  | 26:BRANCH(78)
   0 <> <'p%x{e3}o>  | 27:  TRIE-EXACT["'](79)
   0 <> <'p%x{e3}o>  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o>  |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o>  | 55:  CURLYX[0] {1,32767}(75)
   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)   1 <'> <p%x{e3}o>          | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   4 <'p%x{e3}> <o>  | 70:            BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:            NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   2 <'p> <%x{e3}o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   2 <'p> <%x{e3}o>  | 57:            BRANCH(70)
   2 <'p> <%x{e3}o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 2 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
   4 <'p%x{e3}> <o>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:                  BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                      WHILEM[2/2](0)
                                                whilem: matched 3 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                        BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                          PLUS(74)
                                                    ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647.
..
                                                    failed...
   5 <'p%x{e3}o> <>  | 70:                        BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                          EXACT <''>(74)
                                                    failed...
                                                  BRANCH failed...
                                                whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                        NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                        EXACT <'>(79)
                                                  failed...
                                                failed...
                                              failed...
   4 <'p%x{e3}> <o>  | 70:                  BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:                  NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   2 <'p> <%x{e3}o>  | 70:            BRANCH(73)
   2 <'p> <%x{e3}o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   2 <'p> <%x{e3}o>  | 75:            NOTHING(76)
   2 <'p> <%x{e3}o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
                                  failed...
   1 <'> <p%x{e3}o>  | 70:      BRANCH(73)
   1 <'> <p%x{e3}o>  | 71:        EXACT <''>(74)
                                  failed...
                                BRANCH failed...
                              failed...
                            failed...
                          BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

Run Code Online (Sandbox Code Playgroud)

What you see happening there is that the trie quickly branches down to node 55, which is the last of your three alternatives after it has matched the single quote, because your target string starts with a single quote. That is this one:

  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted

Run Code Online (Sandbox Code Playgroud)

Node 55 is this trie branch:

        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)

Run Code Online (Sandbox Code Playgroud)

And here is the execution trace showing where your catastrophic backoff is happening:

   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)
   1 <'> <p%x{e3}o>  | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767

Run Code Online (Sandbox Code Playgroud)

Node 58 gobbled up all 3 remaining characters in the string, pão. That caused the terminating exact match of a single quote to fail. It therefore attempts your alternative, which is a pair of single quotes, and that also fails.

It is at this point I have to question your pattern. Shouldn’t

' (?: [^']+ | '' )+ '

Run Code Online (Sandbox Code Playgroud)

really simply be this?

' [^']* '

Run Code Online (Sandbox Code Playgroud)

So what is going on is that there are lot of ways for it to backtrack looking for something that can logically never occur. You have a nested quantifier, and that is causing all kinds of hopeless and mindless busywork.

If we swap simplify the pattern into this:

^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +
    | " [^"]* "
    | ' [^']* '
  )

Run Code Online (Sandbox Code Playgroud)

It now gives the same number of trace output lines no matter the size of the input string: only 40, and that includes the compilation. Witness both compilation and execution on the full string:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (61)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (61)
        <"> (29)
  29:     STAR (41)
  30:       ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  41:     EXACT <"> (61)
        <'> (46)
  46:     STAR (58)
  47:       ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  58:     EXACT <'> (61)
  60: TAIL (61)
  61: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mast
ercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o >  |  1:BOL(2)
   0 <> <'p%x{e3}o >  |  2:BRANCH(26)
   0 <> <'p%x{e3}o >  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                             failed...
   0 <> <'p%x{e3}o >  | 26:BRANCH(60)
   0 <> <'p%x{e3}o >  | 27:  TRIE-EXACT["'](61)
   0 <> <'p%x{e3}o >  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d> |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                             got 1 possible matches
                             TRIE matched word #2, continuing
                             only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d> | 46:  STAR(58)
                             ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
                             failed...
                           BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

Run Code Online (Sandbox Code Playgroud)

I know you were thinking that possessive matching might be the answer here, but I think the real problem is the faulty logic in the original pattern. See how more sanely this runs now?

If we run it with your possessives on the old pattern, even though I don’t think that makes sense, we still get constant runtime, but it takes more steps. With this pattern

   ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +    # alt 1: unquoted
       | " (?: [^"]++ | "" )++ "        # alt 2: double quoted
       | ' (?: [^']++ | '' )++ '        # alt 3: single quoted
     )

Run Code Online (Sandbox Code Playgroud)

The compilation plus execution profile is as follows:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (95)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (95)
        <"> (29)
  29:     SUSPEND (58)
  31:       CURLYX[0] {1,32767} (55)
  33:         BRANCH (50)
  34:           SUSPEND (54)
  36:             PLUS (48)
  37:               ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  48:             SUCCEED (0)
  49:           TAIL (53)
  50:         BRANCH (FAIL)
  51:           EXACT <""> (54)
  53:         TAIL (54)
  54:       WHILEM[1/2] (0)
  55:       NOTHING (56)
  56:       SUCCEED (0)
  57:     TAIL (58)
  58:     EXACT <"> (95)
        <'> (63)
  63:     SUSPEND (92)
  65:       CURLYX[0] {1,32767} (89)
  67:         BRANCH (84)
  68:           SUSPEND (88)
  70:             PLUS (82)
  71:               ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  82:             SUCCEED (0)
  83:           TAIL (87)
  84:         BRANCH (FAIL)
  85:           EXACT <''> (88)
  87:         TAIL (88)
  88:       WHILEM[2/2] (0)
  89:       NOTHING (90)
  90:       SUCCEED (0)
  91:     TAIL (92)
  92:     EXACT <'> (95)
  94: TAIL (95)
  95: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mastercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o > |  1:BOL(2)
   0 <> <'p%x{e3}o > |  2:BRANCH(26)
   0 <> <'p%x{e3}o > |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o > | 26:BRANCH(94)
   0 <> <'p%x{e3}o > | 27:  TRIE-EXACT["'](95)
   0 <> <'p%x{e3}o > |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d>|      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d>| 63:  SUSPEND(92)
   1 <'> <p%x{e3}o d>| 65:    CURLYX[0] {1,32767}(89)
   1 <'> <p%x{e3}o d>| 88:      WHILEM[2/2](0)
                                whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o d>| 67:        BRANCH(84)
   1 <'> <p%x{e3}o d>| 68:          SUSPEND(88)
   1 <'> <p%x{e3}o d>| 70:            PLUS(82)
                                      ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
  64 <NTABILIDAD])> <| 82:              SUCCEED(0)
                                        subpattern success...
  64 <NTABILIDAD])> <| 88:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
  64 <NTABILIDAD])> <| 67:            BRANCH(84)
  64 <NTABILIDAD])> <| 68:              SUSPEND(88)
  64 <NTABILIDAD])> <| 70:                PLUS(82)
                                          ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                          failed...
                                        failed...
  64 <NTABILIDAD])> <| 84:            BRANCH(87)
  64 <NTABILIDAD])> <| 85:              EXACT <''>(88)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
  64 <NTABILIDAD])> <| 89:            NOTHING(90)
  64 <NTABILIDAD])> <| 90:            SUCCEED(0)
                                      subpattern success...
  64 <NTABILIDAD])> <| 92:  EXACT <'>(95)
                failed...
              BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

Run Code Online (Sandbox Code Playgroud)

I still like my solution better. It’s shorter.

EDIT

It appears that the Java version really is taking 100 times more steps than the Perl version of an identical pattern, and I have no idea why — other than that the Perl regex compiler is about 100 times smarter at optimizations than the Java regex compiler, which doesn’t ever do any to speak of, and should.

Here’s the equivalent Java program. I’ve removed the leading anchor so that we can loop properly.

$ cat java.crap
import java.util.regex.*;

public class crap {

public static void
main(String[ ] argv) {
    String input = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";
    String regex = "\n"
                + "(?: [^'\"\\s~:/@\\#|^&\\[\\]()\\\\{}]    # alt 1: unquoted         \n"
                + "    [^\"\\s~:/@\\#|^&\\[\\]()\\\\{}] *                     \n"
                + "  | \" (?: [^\"]++ | \"\" )++ \"       # alt 2: double quoted   \n"
                + "  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   \n"
                + ")                                                          \n"
                ;
    System.out.printf("Matching ‹%s› =~ qr{%s}x\n\n", input, regex);

    Pattern regcomp = Pattern.compile(regex, Pattern.COMMENTS);
    Matcher regexec = regcomp.matcher(input);

    int count;
    for (count = 0; regexec.find(); count++) {
       System.out.printf("Found match: ‹%s›\n", regexec.group());
    }
    if (count == 0) {
        System.out.printf("Match failed.\n");
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

When run, that produces this:

$ javac -encoding UTF-8 crap.java && java -Dfile.encoding=UTF-8 crap
Matching ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])› =~ qr{
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted         
    [^"\s~:/@\#|^&\[\]()\\{}] *                     
  | " (?: [^"]++ | "" )++ "       # alt 2: double quoted   
  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   
)                                                          
}x

Found match: ‹pão›
Found match: ‹de›
Found match: ‹açúcar›
Found match: ‹itaucard›
Found match: ‹mastercard›
Found match: ‹platinum›
Found match: ‹SUSTENTABILIDAD›

Run Code Online (Sandbox Code Playgroud)

As you can plainly see, pattern matching in Java has quite a lot to be said about it, absolutely none of which would pass the potty-mouth police. It’s simply a royal pain in the ass.

Answer 2

Ala*_*ore 6

我不得不承认这一点让我感到惊讶,但我在RegexBuddy中获得了同样的结果:它在经过一百万步之后就退出了.我知道关于灾难性回溯的警告倾向于关注嵌套量词,但根据我的经验,交替至少是危险的.事实上,如果我改变你的正则表达式的最后一部分:

'(?:[^']+|'')+'

Run Code Online (Sandbox Code Playgroud)

......对此:

'(?:[^']*(?:''[^']*)*)'

Run Code Online (Sandbox Code Playgroud)

......它只用了十一步就失败了.这是Friedl的"展开循环"技术的一个例子,他将其分解为:

opening normal * ( special normal * ) * closing
   '     [^']        ''     [^']           '

Run Code Online (Sandbox Code Playgroud)

嵌套的星星是安全的,只要:

special并且normal永远不能匹配相同的东西,
special 总是匹配至少一个字符,和
special 是原子的(必须只有一种方式匹配).

然后正则表达式将无法与最小的回溯匹配,并且在没有回溯的情况下成功.另一方面,交替版本几乎可以保证回溯,并且在不可能匹配的情况下,随着目标字符串的长度增加,它会快速失控.如果它在一些风格中没有过度回溯,那是因为它们具有专门用于解决这个问题的优化 - 到目前为止很少有味道.

归档时间：	14 年，4 月前
查看次数：	1096 次
最近记录：	14 年，4 月前