Mar*_*der 19 .net regex regex-lookarounds
在正则表达式中匹配重复的字符很简单,使用反向引用:
(.)\1
Run Code Online (Sandbox Code Playgroud)
但是,我想在这对角色之后匹配角色,所以我想我可以简单地将它放在一个后视镜中:
(?<=(.)\1).
Run Code Online (Sandbox Code Playgroud)
这是为什么?在其他方面,我不会感到惊讶,因为对lookbehinds有强烈的限制,但.NET通常支持lookbehinds中任意复杂的模式.
Mar*_*der 23
简短版本:Lookbehinds从右到左匹配.这意味着当正则表达式引擎遇到\1它时,它还没有捕获到该组中的任何内容,因此正则表达式总是失败.解决方案非常简单:
(?<=\1(.)).
Run Code Online (Sandbox Code Playgroud)
不幸的是,一旦你开始使用更复杂的模式,完整的故事会更加微妙.所以这是......
首先,一些重要的致谢.通过大量实验,教我看起来像是从右到左匹配的人(并通过大量的实验自己想出来),Kobi就是这个答案.不幸的是,我当时提出的问题是一个非常复杂的例子,它没有为这样一个简单的问题提供很好的参考.因此,我们认为制作一个新的更规范的帖子以供将来参考并作为一个合适的欺骗目标是有意义的.但请考虑给予Kobi一个upvote,用于找出.NET的正则表达式引擎的一个非常重要的方面,该引擎几乎没有文档(据我所知,MSDN 在一个非显而易见的页面上用一句话提到它).
请注意,rexegg.com以不同的方式解释了.NET的lookbehinds的内部工作原理(在反转字符串,正则表达式和任何潜在的捕获方面).虽然这不会对匹配的结果产生影响,但我发现这种方法更难以推理,而且从查看代码可以清楚地知道这不是实现的实际功能.
所以.第一个问题是,为什么它实际上比上面的粗体句更微妙.让我们尝试匹配前面有一个a或A使用本地不区分大小写的修饰符的字符.鉴于从右到左的匹配行为,人们可能期望这样做:
(?<=a(?i)).
Run Code Online (Sandbox Code Playgroud)
但是,正如你在这里看到的那样,似乎根本没有使用修饰符.实际上,如果我们把修饰语放在前面:
(?<=(?i)a).
Run Code Online (Sandbox Code Playgroud)
另一个例子,考虑到从右到左的匹配可能会令人惊讶,如下:
(?<=\2(.)(.)).
Run Code Online (Sandbox Code Playgroud)
是\2指左侧还是右侧的捕获组?它指的是正确的,如本例所示.
最后一个例子:当匹配时abc,这个捕获b还是ab?
(?<=(b|a.))c
Run Code Online (Sandbox Code Playgroud)
它抓住了b.(您可以在"表格"选项卡上看到捕获.)再次"从右到左应用后视图"并不是完整的故事.
因此,这篇文章试图成为关于.NET中正则表达式方向性的所有事情的综合参考,因为我不知道任何这样的资源.在.NET中阅读复杂正则表达式的技巧是在三遍或四遍中完成.除了最后一遍之外的所有通行证都是从左到右,无论是什么样的背景或者RegexOptions.RightToLeft.我相信是这种情况,因为.NET在解析和编译正则表达式时会处理这些.
这基本上就是上面的例子所示.如果你的正则表达式中的任何地方,你有这个片段:
...a(b(?i)c)d...
Run Code Online (Sandbox Code Playgroud)
无论模式中的位置或是否使用RTL选项,c都将不区分大小写a,b而d不会(如果它们不受某些其他前置或全局修饰符的影响).这可能是最简单的规则.
对于此过程,您应该完全忽略模式中的任何命名组,即表单中的那些组(?<a>...).请注意,这不包括与明确的组数一样(?<2>...)(这是在.NET的事).
捕获组从左到右编号.无论您使用RTL选项还是嵌套数十个lookbehinds和lookaheads,你的正则表达式有多复杂并不重要.当您仅使用未命名的捕获组时,它们将从左到右编号,具体取决于其左括号的位置.一个例子:
(a)(?<=(b)(?=(.)).((c).(d)))(e)
?1? ?2? ?3? ??5? ?6?? ?7?
????4????
Run Code Online (Sandbox Code Playgroud)
将未标记的组与明确编号的组混合时,这会变得有点棘手.你仍然应该从左到右阅读所有这些,但规则有点棘手.您可以按如下方式确定组的编号:
(?<1>.)(?<5>.)是与组号完全有效的正则表达式2来4使用.这是一个例子(没有嵌套,为简单起见;记得在嵌套时用它们的开括号排序):
(a)(?<1>b)(?<2>c)(d)(e)(?<6>f)(g)(h)
?1????1??????2????3??4????6????5??7?
Run Code Online (Sandbox Code Playgroud)
Notice how the explicit group 6 creates a gap, then the group capturing g takes that unused gap between groups 4 and 6, whereas the group capturing h takes 7 because 6 is already used. Remember that there might be named groups anywhere in between these, which we're completely ignoring for now.
If you're wondering what the purpose of repeated groups like group 1 in this example is, you might want to read about balancing groups.
Of course, you can skip this pass entirely if there are no named groups in the regex.
It's a little known feature that named groups also have (implicit) group numbers in .NET, which can be used in backreferences and substitution patterns for Regex.Replace. These get their numbers in a separate pass, once all the unnamed groups have been processed. The rules for giving them numbers are as follows:
A more complete example with all three types of groups, explicitly showing passes two and three:
(?<a>.)(.)(.)(?<b>.)(?<a>.)(?<5>.)(.)(?<c>.)
Pass 2: ? ??1??2?? ?? ????5????3?? ?
Pass 3: ???4??? ???6??????4??? ???7???
Run Code Online (Sandbox Code Playgroud)
Now that we know which modifiers apply to which tokens and which groups have which numbers, we finally get to the part that actually corresponds to the execution of the regex engine, and where we start going back and forth.
.NET's regex engine can process regex and string in two directions: the usual left-to-right mode (LTR) and its unique right-to-left mode (RTL). You can activate RTL mode for the entire regex with RegexOptions.RightToLeft. In that case, the engine will start trying to find a match at the end of the string and will go left through the regex and the string. For example, the simple regex
a.*b
Run Code Online (Sandbox Code Playgroud)
Would match a b, then it would try to match .* to the left of that (backtracking as necessary) such that there's an a somewhere to the left of it. Of course, in this simple example, the result between LTR and RTL mode is identical, but it helps to make a conscious effort to follow the engine in its backtracking. It can make a difference for something as simple as ungreedy modifiers. Consider the regex
a.*?b
Run Code Online (Sandbox Code Playgroud)
instead. We're trying to match axxbxxb. In LTR mode, you get the match axxb as expected, because the ungreedy quantifier is satisfied with the xx. However, in RTL mode, you'd actually match the entire string, since the first b is found at the end of the string, but then .*? needs to match all of xxbxx for a to match.
And clearly it also makes a difference for backreferences, as the example in the question and at the top of this answer shows. In LTR mode we use (.)\1 to match repeated characters and in RTL mode we use \1(.), since we need to make sure that the regex engine encounters the capture before it tries to reference it.
With that in mind, we can view lookarounds in a new light. When the regex engine encounters a lookbehind, it processes it as follows:
x in the target string as well as its current processing direction.x.x and the original processing direction is restored.While a lookahead seems a lot more innocuous (since we almost never encounter problems like the one in the question with them), its behaviour is actually virtually the same, except that it enforces LTR mode. Of course in most patterns which are LTR only, this is never noticed. But if the regex itself is matched in RTL mode, or we're doing something as crazy as putting a lookahead inside a lookbehind, then the lookahead will change the processing direction just like the lookbehind does.
So how should you actually read a regex that does funny stuff like this? The first step is to split it into separate components, which are usually individual tokens together with their relevant quantifiers. Then depending on whether the regex is LTR or RTL, start going from top to bottom or bottom to top, respectively. Whenever you encounter a lookaround in the process, check which way its facing and skip to the correct end and read the lookaround from there. When you're done with the lookaround, continue with the surrounding pattern.
Of course there's another catch... when you encounter an alternation (..|..|..), the alternatives are always tried from left to right, even during RTL matching. Of course, within each alternative, the engine proceeds from right to left.
Here is a somewhat contrived example to show this:
.+(?=.(?<=a.+).).(?<=.(?<=b.|c.)..(?=d.|.+(?<=ab*?))).
Run Code Online (Sandbox Code Playgroud)
And here is how we can split this up. The numbers on the left show the reading order if the regex is in LTR mode. The numbers on the right show the reading order in RTL mode:
LTR RTL
1 .+ 18
(?=
2 . 14
(?<=
4 a 16
3 .+ 17
)
5 . 13
)
6 . 13
(?<=
17 . 12
(?<=
14 b 9
13 . 8
|
16 c 11
15 . 10
)
12 .. 7
(?=
7 d 2
8 . 3
|
9 .+ 4
(?<=
11 a 6
10 b*? 5
)
)
)
18 . 1
Run Code Online (Sandbox Code Playgroud)
I sincerely hope that you'll never use something as crazy as this in production code, but maybe one day a friendly colleague will leave some crazy write-only regex in your company's code base before being fired, and on that day I hope that this guide might help you figure out what the hell is going on.
For the sake of completeness, this section explains how balancing groups are affected by the directionality of the regex engine. If you don't know what balancing groups are, you can safely ignore this. If you want to know what balancing groups are, I've written about it here, and this section assumes that you know at least that much about them.
There are three types of group syntax that are relevant for balancing groups.
(?<a>...) or (?<2>...) (or even implicitly numbered groups), which we've dealt with above.(?<-a>...) and (?<-2>...). These behave as you'd expect them to. When they're encountered (in the correct processing order described above), they simply pop from the corresponding capture stack. It might be worth noting that these don't get implicit group numbers.(?<b-a>...) which are usually used to capture the string since the last of b. Their behaviour gets weird when mixed with right-to-left mode, and that's what this section is about.The takeaway is, the (?<b-a>...) feature is effectively unusable with right-to-left mode. However, after a lot of experimentation, the (weird) behaviour actually appears to follow some rules, which I'm outlining here.
First, let's look at an example which shows why lookarounds complicate the situation. We're matching the string abcde...wvxyz. Consider the following regex:
(?<a>fgh).{8}(?<=(?<b-a>.{3}).{2})
Run Code Online (Sandbox Code Playgroud)
Reading the regex in the order I presented above, we can see that:
fgh into group a..{2} moves two characters to the left.(?<b-a>.{3}) is the balancing group which pops the capture off group a and pushes something onto group b. In this case, the group matches lmn and we push ijk onto group b as expected.However, it should be clear from this example, that by changing the numerical parameters, we can change the relative position of the substrings matched by the two groups. We can even make those substrings intersect, or have one contained completely inside the other by making the 3 smaller or larger. In this case it's no longer clear what it means to push everything between the two matched substrings.
It turns out that there are three cases to distinguish.
(?<a>...) matches left of (?<b-a>...)This is the normal case. The top capture is popped from a and everything between the substrings matched by the two groups is pushed onto b. Consider the following two substrings for the two groups:
abcdefghijklmnopqrstuvwxyz
???<a>??? ???<b-a>???
Run Code Online (Sandbox Code Playgroud)
Which you might get with the regex
(?<a>d.{8}).+$(?<=(?<b-a>.{11}).)
Run Code Online (Sandbox Code Playgroud)
Then mn would be pushed onto b.
(?<a>...) and (?<b-a>...) intersectThis includes the case where the two substrings touch, but don't contain any common characters (only a common boundary between characters). This can happen if one of the groups is inside a lookaround and the other one is not or is inside a different lookaround. In this case the intersection of both subtrings will be pushed onto b. This is still true when substring is completely contained inside the other.
Here are several examples to show this:
Example: Pushes onto <b>: Possible regex:
abcdefghijklmnopqrstuvwxyz "" (?<a>d.{8}).+$(?<=(?<b-a>.{11})...)
???<a>??????<b-a>???
abcdefghijklmnopqrstuvwxyz "jkl" (?<a>d.{8}).+$(?<=(?<b-a>.{11}).{6})
???<a>??? ?
???<b-a>???
abcdefghijklmnopqrstuvwxyz "klmnopq" (?<a>k.{8})(?<=(?<b-a>.{11})..)
? ???<a>???
???<b-a>???
abcdefghijklmnopqrstuvwxyz "" (?<=(?<b-a>.{7})(?<a>.{4}o))
?<b-a>??<a>?
abcdefghijklmnopqrstuvwxyz "fghijklmn" (?<a>d.{12})(?<=(?<b-a>.{9})..)
?????<a>?????
??<b-a>??
abcdefghijklmnopqrstuvwxyz "cdefg" (?<a>c.{4})..(?<=(?<b-a>.{9}))
? ?<a>? ?
??<b-a>??
Run Code Online (Sandbox Code Playgroud)
(?<a>...) matches right of (?<b-a>...)This case I don't really understand and would consider a bug: when the substring matched by (?<b-a>...) is properly left of the substring matched by (?<a>...) (with at least one character between them, such that they don't share a common boundary), nothing is pushed b. By that I really mean nothing, not even an empty string — the capture stack itself remains empty. However, matching the group still succeeds, and the corresponding capture is popped off the a group.
What's particularly annoying about this is that this case would likely be a lot more common than case 2, since this is what happens if you try to use balancing groups the way they were meant to be used, but in a plain right-to-left regex.
Update on case 3: After some more testing done by Kobi it turns out that something happens on stack b. It appears that nothing is pushed, because m.Groups["b"].Success will be False and m.Groups["b"].Captures.Count will be 0. However, within the regex, the conditional (?(b)true|false) will now use the true branch. Also in .NET it seems to be possible to do (?<-b>) afterwards (after which accessing m.Groups["b"] will throw an exception), whereas Mono throws an exception immediately while matching the regex. Bug indeed.
| 归档时间: |
|
| 查看次数: |
828 次 |
| 最近记录: |