正则表达式中的单词边界是什么？

Question

正则表达式中的单词边界是什么？

我在Java 1.6中使用Java正则表达式(尤其是解析数字输出)并且找不到\b("单词边界")的精确定义.我假设这-12将是一个"整数字"(匹配 \b\-?\d+\b),但似乎这不起作用.我很想知道匹配空格分隔数字的方法.

例:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

Run Code Online (Sandbox Code Playgroud)

返回:

true
false
true

Run Code Online (Sandbox Code Playgroud)

Answer 1

bri*_*ary 80

词边界,在大多数的正则表达式的方言,是一个之间的位置\w和\W(非字字符)的开始或结束的字符串的,或者如果它与字字符开始或结束(分别)( [0-9A-Za-z_]).

因此,在字符串中"-12",它将在1之前或之后匹配.短划线不是单词字符.

Correctamundo.`\ b`是一个零宽度断言,如果一边有'\ w`,并且另一边有'\ W`或者位置是字符串的开头或结尾,则匹配.`\ w`被任意定义为"标识符"字符(alnums和下划线),而不是对英语特别有用的东西. (25认同)
@brianary稍微简单一些:`(？<!\ w)你好(？!\ w)`. (5认同)
类似于：（（^ | \ W）hello（$ | \ W）`，除了它不会捕获前后的任何非单词字符，因此更像是`（^ |（？<= \ W））hello（$ |（？= \ W））`（使用前瞻/后置断言）。 (4认同)
为了理解，是否可以在不使用\ b的情况下（使用\ w，\ W和其他）来重写正则表达式\ bhello \ b`？ (3认同)

Answer 2

Wol*_*gon 23

单词边界可以出现在以下三个位置之一:

在字符串中的第一个字符之前,如果第一个字符是单词字符.
在字符串中的最后一个字符之后,如果最后一个字符是单词字符.
在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符.

单词字符是字母数字; 减号不是.取自正则表达式教程.

Answer 3

Ala*_*ore 12

单词边界是一个位置,前面是一个单词字符,后面没有一个单词,或者后跟一个单词字符,前面没有一个字符.

我是唯一一个在读完答案后仍然想解决谜题的人吗？ (4认同)
当我写这篇文章时，我正在经历一个极简主义阶段。 (3认同)

Answer 4

tch*_*ist 7

我\b在这里谈论什么样的正则表达式边界.

简短的故事是他们是有条件的.他们的行为取决于他们的下一步.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

Run Code Online (Sandbox Code Playgroud)

有时这不是你想要的.请参阅我的其他答案进行详细说明.

Answer 5

snr*_*snr 7

在学习正则表达式的过程中，我真的陷入了元字符\b。当我反复问自己“ 它是什么，它是什么 ”时，我确实没有理解它的含义。通过使用该网站进行一些尝试之后，我在每个单词的开头和结尾处都看到了粉红色的竖线。当时我很明白它的意思。现在恰好是word（\w）-boundary。

我的观点仅是极度理解。它背后的逻辑应该从另一个答案中进行检验。

一个非常好的网站，可以了解什么是单词边界以及匹配是如何发生的 (6认同)
这篇文章值得赞扬的是展示而不是讲述。一张图胜过一千个字。 (6认同)

Answer 6

小智 6

搜索文本时的话喜欢我遇到了一个更糟糕的问题.NET，C++，C#，和C。你会认为计算机程序员比给一种难以编写正则表达式的语言命名更清楚。

无论如何，这就是我发现的（主要来自http://www.regular-expressions.info，这是一个很棒的网站）：在大多数正则表达式中，由简写字符类匹配的字符\w是被单词边界视为单词字符的字符。Java 是个例外。Java 支持 Unicode for\b但不支持\w. （我确信当时有一个很好的理由）。

该\w代表“单词字符”。它总是匹配 ASCII 字符[A-Za-z0-9_]。请注意包含下划线和数字（但不是破折号！）。在大多数支持 Unicode 的风格中，\w包括许多来自其他脚本的字符。关于实际包含哪些字符存在很多不一致之处。通常包括来自字母文字和表意文字的字母和数字。可能包含也可能不包含下划线和数字符号以外的连接符标点符号。XML Schema 和 XPath 甚至包括\w. 但是 Java、JavaScript 和 PCRE 仅匹配带有\w.

这就是为什么基于 Java 的正则表达式搜索C++, C#or .NET（即使您记得要避开句号和加号）被\b.

注意：我不知道如何处理文本中的错误，比如有人没有在句末的句号后加空格。我允许这样做，但我不确定这是否一定是正确的做法。

无论如何，在 Java 中，如果您正在搜索那些名称怪异的语言的文本，则需要将\b前后的空格和标点符号替换为。例如：

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

Run Code Online (Sandbox Code Playgroud)

然后在您的测试或主要功能中：

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

Run Code Online (Sandbox Code Playgroud)

PS 我要感谢http://regexpal.com/没有他，正则表达式世界将非常悲惨！