zx8*_*x81 25 php regex recursion
编辑:我选择了ridgerunner的答案,因为它包含解决问题所需的信息.但我也想为特定问题添加一个完全充实的解决方案,以防其他人想要完全理解这个例子.你会发现它在下面的某个地方.
这个问题是关于澄清php的正则表达式引擎的递归表达式的行为.(如果你想法如何在不使用递归的php正则表达式的情况下正确匹配下面的字符串,这非常酷,但这不是问题.)
a(?:(?R)|a?)a
Run Code Online (Sandbox Code Playgroud)
这是一个简单的表达式,旨在匹配字符"a"或没有任何内容,嵌套在字符"a"的一个或多个嵌套中.例如,aa,aaa,aaaa,aaaaa.您不需要为此使用递归:
aa*a
Run Code Online (Sandbox Code Playgroud)
会很棒.但重点是使用递归.
以下是您可以运行的一段代码来测试我的失败模式:
<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>
Run Code Online (Sandbox Code Playgroud)
在该模式中,两个"a"构成交替.在交替中,我们要么匹配整个模式的递归(两个"a"构成交替),要么匹配字符"a",可选地为空.
在我看来,对于"aaaa",这应该与"aaaa"相匹配.
但这是输出:
a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa
Run Code Online (Sandbox Code Playgroud)
有人能解释第三和第五行输出的情况吗?我试过追踪我想象引擎必须采取的路径,但我必须想象它是错的.为什么引擎返回"aaa"作为"aaaa"的匹配?是什么让它如此渴望?我必须以错误的顺序想象匹配的树.
我意识到了
#(?:a|a(?R)a)*#
Run Code Online (Sandbox Code Playgroud)
有点作品,但我的问题是为什么其他模式没有.
谢谢堆!
rid*_*ner 13
优秀(和困难)的问题!
首先,使用PCRE正则表达式引擎,其(?R)行为类似于原子组(与Perl不同).一旦匹配(或不匹配),递归调用内发生的匹配就是最终的(并且丢弃在递归调用中保存的所有回溯面包屑).但是,正则表达式引擎确实保存了整个(?R)表达式匹配的内容,并且可以将其返回并尝试另一种替代方法来实现整体匹配.为了描述正在发生的事情,让我们稍微改变你的例子,这样就可以更容易地谈论并跟踪每一步的匹配内容.而不是:aaaa作为主题文本,让我们使用:abcd.然后让我们将正则表达式'#a(?:(?R)|a?)a#'改为:'#.(?:(?R)|.?).#'.正则表达式引擎匹配行为是相同的.
/.(?:(?R)|.?)./to:"abcd"answer = r'''
Step Depth Regex Subject Comment
1 0 .(?:(?R)|.?). abcd Dot matches "a". Advance pointers.
^ ^
2 0 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 1).
^ ^
3 1 .(?:(?R)|.?). abcd Dot matches "b". Advance pointers.
^ ^
4 1 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 2).
^ ^
5 2 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers.
^ ^
6 2 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 3).
^ ^
7 3 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers.
^ ^
8 3 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 4).
^ ^
9 4 .(?:(?R)|.?). abcd Dot fails to match end of string.
^ ^ DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
Give back text consumed by depth 4 (?R) = ""
10 3 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches EOS.
^ ^ Advance regex pointer.
11 3 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ DEPTH 3 (?R) FAILS. Return to step 6 depth 2
Give back text consumed by depth3 (?R) = "d"
12 2 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "d".
^ ^ Advance pointers.
13 2 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to step 12 depth 2
14 2 .(?:(?R)|.?). abcd Match zero "d" (give it back).
^ ^ Advance regex pointer.
15 2 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers.
^ ^ DEPTH 2 (?R) SUCCEEDS.
Return to step 4 depth 1
16 1 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to try other alternative. Give back
text consumed by depth 2 (?R) = "cd"
17 1 .(?:(?R)|.?). abcd Optional dot matches "c". Advance pointers.
^ ^
18 1 .(?:(?R)|.?). abcd Required dot matches "d". Advance pointers.
^ ^ DEPTH 1 (?R) SUCCEEDS.
Return to step 2 depth 0
19 0 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to try other alternative. Give back
text consumed by depth 1 (?R) = "bcd"
20 0 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "b".
^ ^ Advance pointers.
21 0 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers.
^ ^ SUCCESSFUL MATCH of "abc"
'''
Run Code Online (Sandbox Code Playgroud)
正则表达式引擎没有任何问题.正确的匹配是abc(或者aaa对于原始问题.)对于所讨论的另一个较长的结果字符串,可以进行类似的(尽管更长)步骤序列.
Wis*_*guy 12
重要说明:这描述了PHP中的递归正则表达式(使用PCRE库).递归正则表达式在Perl中的工作方式略有不同.
注意:这可以按照您可以概念化的顺序进行说明.正则表达式引擎向后执行此操作; 它潜入基础案例并回归.
由于你的外部as明确地存在,它将匹配a两个as 之间的匹配,或者两个as 之间的整个模式的先前递归匹配.结果,它只匹配奇数个as(中间一个加上两个的倍数).
在三个长度上,aaa是当前递归的匹配模式,因此在第四次递归时,它正在寻找a两个as(即,aaa)之间或之前递归的两个as 之间的匹配模式(即a+ aaa+ a).显然,a当字符串不长时,它不能匹配5 秒,所以它可以做的最长匹配是3.
类似处理长度为6,因为它只能匹配aaa由as 包围的"默认" 或前一个递归的匹配(即a+ aaaaa+ a).
但是,它并不能满足所有奇数长度.
由于您是递归匹配的,因此您只能匹配文字aaa或a+(prev recurs match)+ a.因此,每次连续比赛总是a比前一场比赛长两秒,否则它将会缩减并回落aaa.
在长度为7(匹配aaaaaaa)时,前一次递归的匹配是后备aaa.所以这一次,即使有七个as,它也只会匹配三个(aaa)或五个(a+ aaa+ a).
循环到更长的长度(在此示例中为80)时,请查看模式(仅显示匹配,而不是输入):
no match
aa
aaa
aaa
aaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
Run Code Online (Sandbox Code Playgroud)
这里发生了什么?好吧,我会告诉你的!:-)
当递归匹配比输入字符串长一个字符时,它就会回归aaa,正如我们所见.在此之后的每次迭代中,模式开始匹配比前一个匹配多两个字符.每次迭代,输入的长度都会增加1,但匹配的长度会增加2.当匹配大小最终捕获并超过输入字符串的长度时,它会回到aaa.等等.
或者查看,在这里我们可以看到输入与每次迭代中的匹配长度相比有多少个字符:
(input len.) - (match len.) = (difference)
1 - 0 = 1
2 - 2 = 0
3 - 3 = 0
4 - 3 = 1
5 - 5 = 0
6 - 3 = 3
7 - 5 = 2
8 - 7 = 1
9 - 9 = 0
10 - 3 = 7
11 - 5 = 6
12 - 7 = 5
13 - 9 = 4
14 - 11 = 3
15 - 13 = 2
16 - 15 = 1
17 - 17 = 0
18 - 3 = 15
19 - 5 = 14
20 - 7 = 13
21 - 9 = 12
22 - 11 = 11
23 - 13 = 10
24 - 15 = 9
25 - 17 = 8
26 - 19 = 7
27 - 21 = 6
28 - 23 = 5
29 - 25 = 4
30 - 27 = 3
31 - 29 = 2
32 - 31 = 1
33 - 33 = 0
34 - 3 = 31
35 - 5 = 30
36 - 7 = 29
37 - 9 = 28
38 - 11 = 27
39 - 13 = 26
40 - 15 = 25
41 - 17 = 24
42 - 19 = 23
43 - 21 = 22
44 - 23 = 21
45 - 25 = 20
46 - 27 = 19
47 - 29 = 18
48 - 31 = 17
49 - 33 = 16
50 - 35 = 15
51 - 37 = 14
52 - 39 = 13
53 - 41 = 12
54 - 43 = 11
55 - 45 = 10
56 - 47 = 9
57 - 49 = 8
58 - 51 = 7
59 - 53 = 6
60 - 55 = 5
61 - 57 = 4
62 - 59 = 3
63 - 61 = 2
64 - 63 = 1
65 - 65 = 0
66 - 3 = 63
67 - 5 = 62
68 - 7 = 61
69 - 9 = 60
70 - 11 = 59
71 - 13 = 58
72 - 15 = 57
73 - 17 = 56
74 - 19 = 55
75 - 21 = 54
76 - 23 = 53
77 - 25 = 52
78 - 27 = 51
79 - 29 = 50
80 - 31 = 49
Run Code Online (Sandbox Code Playgroud)
由于现在应该有意义的原因,这发生在2的倍数.
我略微简化了这个例子的原始模式.记住这一点.我们将回到它.
a((?R)|a)a
Run Code Online (Sandbox Code Playgroud)
作者杰弗里弗里尔所说的" (R)构造对整个正则表达式的递归引用 "是正则表达式引擎将代替整个模式代替(?R)尽可能多的次数.
a((?R)|a)a # this
a((a((?R)|a)a)|a)a # becomes this
a((a((a((?R)|a)a)|a)a)|a)a # becomes this
# and so on...
Run Code Online (Sandbox Code Playgroud)
手动追踪时,您可以从内到外进行操作.在(?R)|a,a是你的基础案例.所以我们将从那开始.
a(a)a
Run Code Online (Sandbox Code Playgroud)
如果匹配输入字符串,请将match(aaa)恢复为原始表达式并将其替换为(?R).
a(aaa|a)a
Run Code Online (Sandbox Code Playgroud)
如果输入字符串与我们的递归值匹配,aaaaa则将match()替换回原始表达式以再次递归.
a(aaaaa|a)a
Run Code Online (Sandbox Code Playgroud)
重复,直到使用上一次递归的结果无法匹配您的输入.
示例
输入:正则aaaaaa
表达式:a((?R)|a)a
从基础案例开始aaa.
输入是否与此值匹配?是:aaa
通过放入aaa原始表达式来递归:
a(aaa|a)a
Run Code Online (Sandbox Code Playgroud)
输入是否与我们的递归值匹配?是:aaaaa
通过放入aaaaa原始表达式来递归:
a(aaaaa|a)a
Run Code Online (Sandbox Code Playgroud)
输入是否与我们的递归值匹配?没有:aaaaaaa
然后我们就到这儿了.上面的表达式可以重写(为简单起见):
aaaaaaa|aaa
Run Code Online (Sandbox Code Playgroud)
由于它不匹配aaaaaaa,它必须匹配aaa.我们完成了,aaa是最后的结果.