正则表达式匹配和限制字符类

Rob*_*ond 5 java regex

我不确定使用Regex是否可行,但我希望能够根据不同的字符限制允许的下划线数量.这是为了将疯狂的通配符限制限制为用Java编写的搜索引擎.

起始字符是字母数字.但是如果有更多的下划线而不是前面的字符,我基本上想要一个匹配.所以

BA_会很好但是BA___会匹配正则表达式并且会被踢出查询解析器.

这可能使用正则表达式吗?

Cas*_*yte 8

是的,你可以做到.只有当下划线少于字母时,此模式才会成功(您可以使用所需的字符进行调整):

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$
Run Code Online (Sandbox Code Playgroud)

(正如Pshemo注意到的那样,如果你使用这个matches()方法就不需要锚点,我编写它们来说明这个模式必须以任何方式限制的事实.例如,使用外观.)

否定版本:

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$
Run Code Online (Sandbox Code Playgroud)

我们的想法是重复一个包含对自身的反向引用+下划线的捕获组.在每次重复时,捕获组都在增长.^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+将匹配具有相应下划线的所有字母.你只需要添加[A-Z]+以确保有更多的字母,并完成\\1?包含所有下划线的模式(我使它成为可选的,以防根本没有下划线).

请注意,如果在第一个模式中替换[A-Z]+[A-Z]{n},则可以精确设置字母和下划线之间的字符数差异.


为了给出更好的想法,我将尝试逐步描述它如何与字符串一起工作ABC--(因为不可能将下划线以粗体显示,我使用连字符代替):

 In the non-capturing group, the first letter is found 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 let's enter the lookahead (keep in mind that all in the lookahead is only
 a check and not a part of the match result.)
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first capturing group is encounter for the first time and its content is not
 defined. This is the reason why an optional quantifier is used, to avoid to make
 the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first hyphen is matched. Once the capture group closed, the first capture
    group is now defined and contains one hyphen. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The lookahead succeeds, let's repeat the non-capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The second letter is found
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We enter the lookahead
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but now, things are different. The capture group was defined before and
 contains an hyphen, this is why \1?+ will match the first hyphen.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the literal hyphen matches the second hyphen in the string. And now the
 capture group 1 contains the two hypens. The lookahead succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We repeat one more time the non capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 In the lookahead. There is no more letters, it's not a problem, since
 the * quantifier is used.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 \\1?+ matches now two hyphens.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but there is no more hyphen in the string for the literal hypen and the regex
 engine can not use the bactracking since \1?+ has a possessive quantifier.
 The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 ensure that there is at least one more letter.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We match the end of the string with the backreference to capture group 1 that
 contains the two hyphens. Note that the fact that this backreference is optional
 allows the string to not have hyphens at all. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 This is the end of the string. The pattern succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$


注意:需要使用非捕获组的占有量词来避免错误结果.(你可以在哪里观察一种奇怪的行为,这可能很有用.)

示例:ABC---和模式:( ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ 没有占有量词)

 The non-capturing group is repeated three times and `ABC` are matched:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 Note that at this step the first capturing group contains ---
 But after the non capturing group, there is no more letter to match for [A-Z]+
 and the regex engine must backtrack.
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

问题:捕获组中有多少个连字符?
答:   总是三个!

如果重复的非捕获组返回一个字母,则捕获组总是包含三个连字符(正如最后一次正则表达式引擎读取捕获组).这是违反直觉的,但是合乎逻辑的.

 Then the letter C is found:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 And the three hyphens
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 The pattern succeeds
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Robby Pond在评论中问我如何找到下划线比字母更多的字符串(所有这些都不是下划线).显然,最好的方法是计算下划线的数量并与字符串长度进行比较.但是关于完整的正则表达式解决方案,由于模式需要使用递归功能,因此无法使用Java为其构建模式.例如,你可以用PHP做到这一点:

$pattern = <<<'EOD'
~
 (?(DEFINE)
     (?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
 )

 \A (?: \g<neutral> | _ )+ \z
~x
EOD;

var_dump(preg_match($pattern, '____ABC_DEF___'));
Run Code Online (Sandbox Code Playgroud)