Perl的正则表达式实现如何利用尝试?

sir*_*usa 5 regex optimization perl trie

我正在探索优化基于堆栈的回溯正则表达式实现的方法(另请参见本主题)。在评论中给出了有关Perl的正则表达式匹配器的提示。从源代码中,我已经发现,在将正则表达式与几种替代方案进行匹配时,它会使用尝试。

例如,像/(?:foo|bar|baz|foobar)$/这样的正则表达式通常会产生类似的程序

BRANCH
  EXACT <foo>
BRANCH
  EXACT <bar>
BRANCH
  EXACT <baz>
BRANCH
  EXACT <foobar>
EOL
Run Code Online (Sandbox Code Playgroud)

而使用特里的优化版本看起来像

TRIE-EXACT
  <foo>
  <bar>
  <baz>
  <foobar>
EOL
Run Code Online (Sandbox Code Playgroud)

带有EXACT命令的四个分支分别优化为一个TRIE命令。

但是,对于回溯,我不理解在执行阶段如何使用该Trie。

考虑未优化的代码和输入字符串foobar。当试图匹配正则表达式,第一个分支foo会成功匹配时,$会失败,接下来的树枝barbaz会失败,最终分支foobar将匹配,$将匹配,并成功匹配完成。

现在,匹配如何与Trie一起工作?根据我的理解,特里只有一种合理的隐含顺序,其中尝试重叠的条目,即最长的匹配优先。在上述示例中,这将给出正确的匹配。

但这会给/(?:foo|foobar)barr$/和输入错误的结果foobarr。特里将匹配foobar,但在下一个失败,r因为它期望b下一个。因此,不仅必须有一种在Trie中回溯的方法,而且还必须以某种方式保留分支的原始顺序,否则优化的程序在语义上将与非优化的版本不同。

Perl实现如何处理这种基于树的匹配?

dem*_*phq 5

I wrote the original implementation of this code. It has changed somewhat since then, which has slightly changed the time/space tradeoffs for how it works.

First, I need to call out one misunderstanding in your post, the rules for Perls regex engine are not to match the longest string, but rather to match the leftmost-longest string, in particular

"foal"=~/(f|fo|foa|foal)/ 
Run Code Online (Sandbox Code Playgroud)

should match "f" not "foal" as the "f" option is first. This is called the "word order" in the comments in the code. You can see this with Perl by injecting (?{ print $& }) into your patterns:

perl -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
f
foa
fo 
Run Code Online (Sandbox Code Playgroud)

Another point I thought I should explain is that the code and comments use the term "state" and "accepting state" a lot. This comes from the observation that a Trie is equivalent to an acyclic DFA, and can be naively represented as a table of one row per state, with one column per legal character in the input alphabet and with an additional column being used to track whether the state is "accepting", meaning "ends one of the alternatives", and if so which branch of the alternation it terminates. The values in the other columns represent the new state to transition to after reading the columns character. An accepting state may have transitions to more states, as in the case where one branch matches a prefix of another.

Matching then comes down to walking the tree/state table, when we encounter an accepting state we push a 2-tuple of the number of characters we have read so far in the trie and the word number associated with the state into a list. Eventually we either run out of string to read, or encounter a transition to the 0 state which means we can stop as no further matches are possible.

Once you have a list of the tuples of (length,word) you can then go through them in order of word ascending, since such lists are generally short IMO it is reasonable to simply use linear probing to find the minimum word each try. Perl uses a strategy where it precompiles all such possible lists into one structure in advance, so it need only keep track of the branch number of the most recent accepting state, and then can (somewhat expensively) compute any preceeding accepting states from that. For instance the example below /(f|foa|fo)al/ produces a structure like this:

word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
Run Code Online (Sandbox Code Playgroud)

Which shows that if we reach the accepting state for "foa" which is branch 2 we have matched a 3 character string and there is a preceding accepting state of branch 3, which in turn points at 1. From this we can reconstruct the behavior we would expect to see, which is to try "al" after "f", "foa" and then "fo". The process is somewhat expensive (N**2) compared to other solutions, but has the advantage of being precompiled and reusable and the storage requirements fit nicely with how the regex backtracking engine works. IMO it doesnt matter, the main issue is that somehow you track the accepting states where you were in the string when you encountered them, and then try the "tail" in the right order.

You can see much of this at work with extended debugging:

perl -Mre=Debug,TRIE,TRIEE,TRIEM -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/' 
Compiling REx "(?:f|foa|fo)(?{print $&})al"
  Looking for TRIE'able sequences. Tail node is: EVAL
  - BRANCH (1) -> EXACT <f> => EVAL (First==-1,Last==-1,Cur==1,tt==END,nt==EXACT,nnt==END)
  - BRANCH (4) -> EXACT <foa>   => EVAL (First==1,Last==-1,Cur==4,tt==EXACT,nt==EXACT,nnt==END)
  - BRANCH (7) -> EXACT <fo>    => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END)
  - TAIL (10) <SCAN FINISHED>
    make_trie start==1, first==1, last==10, tail==11 depth=1
    TRIE(NATIVE): W:3 C:6 Uq:3 Min:1 Max:3
    Compiling trie using table compiler
      Char :    f   o   a
      State+-------------
         1 :    2   .   . (   1)
         2 :    .   3   . (   1) W   1
         3 :    .   .   4 (   1) W   3
         4 :    .   .   . (   0) W   2
Run Code Online (Sandbox Code Playgroud)

Here you can see the DFA/TRIE equivalency and the word data, and how the table is constructed. Because there are only 3 legal characters we build a table that has 4 columns, one per each legal character, and one to track if the state is accepting and if so which branch of the alternation it accepts. The W 2 on row 4 shows that it is accepting for the 2nd branch of the alternation.

    Alloc: 22 Orig: 13 elements, Final:3. Savings of %76.92
    Statecount:5 Lasttrans:4
      Char : Match Base  Ofs     f   o   a
      State|-----------------------------------
      #   1|       @   3 + 0[    2   .   .]
      #   2| W   1 @   3 + 1[    .   3   .]
      #   3| W   3 @   3 + 2[    .   .   4]
      #   4| W   2 @   0 
Run Code Online (Sandbox Code Playgroud)

This shows the naive table representation being packed into a compressed form that allows the zero transition in one row to be used to store non-zero transitions in other rows. Each transition is stored in an array of 2-tuple of [owner-state,transition], with two side arrays, one which stores the start position of a given state, and another that stores the accepting mappings for a given state, state transitions involve checking if table[start_offset[state]+column_ofs].owner_state is the same as state, and if not then table[start_offset[state]+column_ofs].new_state can be assumed to be 0. See the Red Dragon for more details.


    word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
Run Code Online (Sandbox Code Playgroud)

this part of the output shows how we have precomputed the length, and the predecessor branch number for every accepting state. This way we don't have to construct a list of accepting states as we match, but instead can precompute all the possible such lists at compile time into one static structure.

Final program:
   1: EXACT <f> (3)
   3: TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o] (11)
      <> 
      <oa> 
      <o> 
Run Code Online (Sandbox Code Playgroud)

Here you can see an interesting optimization kick in, because all of the alternations start with the letter "f" we "remove" it from the trie, and turn it into a prefix, the EXACT , followed by a trie containing the empty string, the letters "oa" and the letter "o". The S:2/4 means we start on state 2 not on the normal state 1, which is the state that corresponds to the letter 'f'. Extracting the 'f' like this allows it to be used elsewhere in the matching process, which means we can use less expensive machinery the Trie to find possible matches.

  11: EVAL (13)
  13: EXACT <al> (15)
  15: END (0)
anchored "f" at 0 floating "al" at 1..3 (checking floating) minlen 3 with eval 
Guessing start of match in sv for REx "(?:f|foa|fo)(?{print $&})al" against "foal"
Found floating substr "al" at offset 2...
Found anchored substr "f" at offset 0...
Run Code Online (Sandbox Code Playgroud)

The 'f' that was extracted earlier is being used to find the first possible place in the string the pattern could match.

Guessed: match at offset 0
Matching REx "(?:f|foa|fo)(?{print $&})al" against "foal"
   0 <> <foal>               |  1:EXACT <f>(3)
   1 <f> <oal>               |  3:TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o](11)
   1 <f> <oal>               |    State:    2 Accepted: Y Charid:  2 CP:  6f After State:    3
   2 <fo> <al>               |    State:    3 Accepted: Y Charid:  3 CP:  61 After State:    4
   3 <foa> <l>               |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
Run Code Online (Sandbox Code Playgroud)

Here we can see the walk through tree starting at state 2. Each "Accepted" line means we have found a possible match. The last line shows us reading a character not in our alphabet, meaning a transition to state 0 which means we can stop. Unfortunately this does not show the word numbers associated with each match, but you can infer them from the state transition table: 1,3,2, but we need to "try" them 1,2,3.

                                  got 3 possible matches
                                  TRIE matched word #1, continuing
   1 <f> <oal>               | 11:  EVAL(13)
f
   1 <f> <oal>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #2, continuing
   3 <foa> <l>               | 11:  EVAL(13)
foa
   3 <foa> <l>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <o>
   2 <fo> <al>               | 11:EVAL(13)
fo
   2 <fo> <al>               | 13:EXACT <al>(15)
   4 <foal> <>               | 15:END(0)
Match successful!

Run Code Online (Sandbox Code Playgroud)

Note an explanation of different aspects of the Trie compilation process can be found at:

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l2407

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l4726

  • 感谢您的详细回答! (2认同)