正则表达式字符类的双重否定中的错误?

Psh*_*emo 19 java regex

TL; DR
为什么[^\\D2],[^[^0-9]2],[^2[^0-9]]得到不同的结果在Java中?


用于测试的代码.你现在可以跳过它.

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));
Run Code Online (Sandbox Code Playgroud)

让我们说我需要正则表达式接受字符

  • 不是数字,
  • 除了2.

所以,这样的正则表达式应该代表每个字符除0,1,3,4,..., 9.我至少可以通过两种方式来编写它,这将是与2不一致的所有内容的总和:

  • [[^0-9]2]
  • [\\D2]

这两个正则表达式都按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true
Run Code Online (Sandbox Code Playgroud)

现在让我说我想要反转接受的字符.(所以我想接受除2之外的所有数字)我可以创建显式包含所有接受的字符的正则表达式

  • [013-9]

或尝试在另一个包装它否定两个先前描述的正则表达式[^...]

  • [^\\D2]
  • [^[^0-9]2]
    甚至
  • [^2[^0-9]]

但令我惊讶的是,只有前两个版本按预期工作

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true 
Run Code Online (Sandbox Code Playgroud)

所以我的问题是为什么[^[^0-9]2][^2[^0-9]]不表现[^\D2]?我可以以某种方式纠正这些正则表达式,以便我可以[^0-9]在其中使用吗?

Kep*_*pil 16

按照JavaDoc的页面嵌套类产生的工会两班的,这使得它无法使用该符号创建一个交集:

要创建联合,只需将一个类嵌套在另一个类中,例如[0-4 [6-8]].此特定联合创建一个与数字0,1,2,3,4,6,7和8匹配的单个字符类.

要创建交叉点,您必须使用&&:

要创建仅匹配其所有嵌套类共有的字符的单个字符类,请使用&&,如[0-9 && [345]]中所示.此特定交集创建单个字符类,仅匹配两个字符类共有的数字:3,4和5.

问题的最后一部分对我来说仍然是一个谜.结合[^2]并且[^0-9]应该确实如此[^2],所以[^2[^0-9]]表现得如预期的那样.[^[^0-9]2]表现得像[^0-9]的确很奇怪.


nha*_*tdh 15

在Oracle的Pattern类实现的字符类解析代码中有一些奇怪的伏都教,如果你从Oracle的网站下载或者你正在使用OpenJDK,那么它与你的JRE/JDK一起提供.我还没有检查其他JVM(特别是GNU Classpath)实现如何解析问题中的正则表达式.

从这一点来看,对Pattern类及其内部工作的任何引用都严格限于Oracle的实现(参考实现).

阅读并理解Pattern类如何解析嵌套否定需要一些时间,如问题所示.但是,我编写了一个程序1来从一个Pattern对象中提取信息(使用Reflection API)来查看编译结果.以下输出来自在Java HotSpot Client VM版本1.7.0_51上运行我的程序.

1:目前,该计划令人尴尬.当我完成并重构它时,我将用链接更新这篇文章.

[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

这里没什么好惊讶的.

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)
[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

接下来的2个案例编译成同一个程序[^0-9],这是违反直觉的.

[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)
[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

如上所述,在上述2个案例中没有任何异议.

[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

如问题所述,这2个案例按预期工作.但是,请注意引擎如何补充第一个字符类(\D)并将set差异应用于由剩余部分组成的字符类.

[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

通过Keppil在评论中的测试证实,上面的输出显示上面的所有3个正则表达式都被编译到同一个程序中!

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

相反的NOT(UNION(2, NOT(0-9)),这是0-13-9,我们得到的UNION(NOT(2), NOT(0-9)),这相当于NOT(2).

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
Run Code Online (Sandbox Code Playgroud)

由于相同的bug ,正则表达式[^2[^[^0-9]]]编译到同一个程序[^2[^0-9]].

有一个未解决的bug似乎具有相同的性质:JDK-6609854.


说明

初步

以下是Pattern在进一步阅读之前应该知道的类的实现细节:

  • Patternclass将a编译String成一个节点链,每个节点负责一个小而明确定义的职责,并将工作委托给链中的下一个节点.Nodeclass是所有节点的基类.
  • CharPropertyclass是所有与字符类相关Node的基类.
  • BitClassclass是类的子CharProperty类,它使用boolean[]数组来加速Latin-1字符的匹配(代码点<= 255).它有一个add方法,允许在编译期间添加字符.
  • CharProperty.complement,Pattern.union,Pattern.intersection是组对应的操作方法.他们所做的是不言自明的.
  • Pattern.setDifference不对称的集合差异.

乍一看解析角色类

在查看完整的CharProperty clazz(boolean consume)方法代码(负责解析字符类的方法)之前,让我们看一下极其简化的代码版本,以了解代码的流程:

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}
Run Code Online (Sandbox Code Playgroud)

代码基本上读取输入(输入String转换为以null结尾 int[]的代码点),直到它命中]或结束String(未闭合的字符类).

代码有点令人困惑,continuebreakswitch块内混合在一起.但是,只要您意识到它continue属于外部for循环并且break属于switch块,代码就很容易理解:

  • 结束的案例continue将永远不会在switch语句后执行代码.
  • 结束的情况break可以在switch语句之后执行代码(如果它还没有return).

通过上面的观察,我们可以看到,每当发现一个字符是非特殊的并且应该包含在字符类中时,我们将在switch语句之后执行代码,其中node = range(bits);是第一个语句.

如果检查源代码,该方法CharProperty range(BitClass bits)将解析"字符类中的单个字符或字符范围".该方法返回BitClass传入的相同对象(添加新字符)或返回CharProperty类的新实例.

血淋淋的细节

接下来,让我们看一下代码的完整版本(&&省略部分解析字符类交集):

private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}
Run Code Online (Sandbox Code Playgroud)

纵观代码case '[':中的switch语句和之后的代码switch语句:

  • 所述node变量存储解析的结果单元(一个独立的字符,字符范围,一个速记字符类,一POSIX/Unicode字符类或嵌套字符类)
  • prev变量存储编译的结果,到目前为止,始终更新我们编译一个之后单元node.

由于boolean include记录字符类是否被否定的局部变量永远不会传递给任何方法调用,因此它只能在此方法中单独执行.并且includeswitch声明之后读取和处理的唯一地方.

邮政在建

  • @Pshemo:我写信给core-lib-dev,结果发现问题已经讨论过,并且已经提出了补丁[自2011年起](http://mail.openjdk.java.net/pipermail/core-libs-开发/ 2011六月/ 006957.html).我在[给我们写的电子邮件](http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-February/025314.html)中详细解释了当前的行为.我应该用这些信息更新这篇文章吗? (2认同)
  • @ user1803551:这里的程序https://github.com/nhahtdh/pattern-dissector(虽然在我当前的代码后面,但应该输出类似于上面的代码)。但是,我没有时间完成其余的分析。这将需要6到8个小时的重点工作(这是个半开玩笑,但有一半是真实的) (2认同)