正则表达式在Java中没有明显的最大长度

Bar*_*ers 24 java regex

我一直认为Java的regex-API(以及许多其他语言)中的后视断言必须具有明显的长度.因此,STAR和PLUS量词不允许在内部观察.

优秀的在线资源regular-expressions.info似乎证实了(某些)我的假设:

"[...] Java通过允许有限重复更进一步.你仍然不能使用星号或加号,但你可以使用问号和花括号与指定的max参数.Java认识到有限重复的事实可以重写为具有不同但固定长度的字符串的交替.不幸的是,当你在lookbehind中使用交替时,JDK 1.4和1.5有一些错误.这些在JDK 1.6中被修复.[...]"

- http://www.regular-expressions.info/lookaround.html

只要外观中字符范围的总长度小于或等于Integer.MAX_VALUE,就可以使用大括号.所以这些正则表达式是有效的:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
Run Code Online (Sandbox Code Playgroud)

但这些不是:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
Run Code Online (Sandbox Code Playgroud)

但是,我不明白以下几点:

当我运行使用内部的*和+量词测试向后看,一切顺利的话(见输出测试1测试2).

但是,当我在的开头添加一个字符向后看,从测试1测试2,它打破(见输出测试3).

测试3不情愿的贪婪*没有效果,它仍然会中断(参见测试4).

这是测试工具:

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

输出:

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
Run Code Online (Sandbox Code Playgroud)

我的问题可能是显而易见的,但我仍然会问:任何人都可以向我解释为什么测试12失败,而测试34没有?我本以为他们都会失败,不是一半人都在工作而一半人都失败了.

谢谢.

PS.我正在使用:Java版本1.6.0_14

Jon*_*erg 17

浏览Pattern.java的源代码,可以看出'*'和'+'是作为Curly实例实现的(它是为卷曲运算符创建的对象).所以,

a*
Run Code Online (Sandbox Code Playgroud)

实现为

a{0,0x7FFFFFFF}
Run Code Online (Sandbox Code Playgroud)

a+
Run Code Online (Sandbox Code Playgroud)

实现为

a{1,0x7FFFFFFF}
Run Code Online (Sandbox Code Playgroud)

这就是为什么你看到曲线和星星完全相同的行为.


Ala*_*ore 13

这是一个错误:http://bugs.sun.com/view_bug.do?video_id = 6695369

Pattern.compile() 如果无法确定lookbehind匹配的最大可能长度,则总是应该抛出异常.