我正在尝试将包含多个句子的字符串拆分为单个句子的字符串数组.
这是我到目前为止所拥有的,
String input = "Hello World. "
+ "Today in the U.S.A., it is a nice day! "
+ "Hurrah!"
+ "Here it comes... "
+ "Party time!";
String array[] = input.split("(?<=[.?!])\\s+(?=[\\D\\d])");
Run Code Online (Sandbox Code Playgroud)
这段代码工作得非常好.我明白了
Hello World.
Today in the U.S.A., it is a nice day!
Hurrah!
Here it comes...
Party time!
Run Code Online (Sandbox Code Playgroud)
我使用该lookbehind功能来查看结束标点符号的句子是先于某个还是单个white space(s).如果是这样,我们分手了.
但是这个正则表达式没有涵盖一些例外.例如,
The U.S. is a great country被错误地拆分为The U.S.和is a great country.
关于如何解决这个问题的任何想法?
而且,我在这里错过了任何边缘案例吗?
Mar*_*ach 10
如果您不必使用正则表达式,则可以使用Java的内置BreakIterator.
以下代码显示了解析句子的示例,但BreakIterator支持其他形式的解析(word,line等).如果您处理不同的语言,也可以选择传入不同的语言环境.此示例使用默认语言环境.
String input = "Hello World. "
+ "Today in the U.S.A., it is a nice day! "
+ "Hurrah!"
+ "The U.S. is a great country. "
+ "Here it comes... "
+ "Party time!";
BreakIterator iterator = BreakIterator.getSentenceInstance();
iterator.setText(input);
int start = iterator.first();
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
System.out.println(input.substring(start, end));
}
Run Code Online (Sandbox Code Playgroud)
这导致以下输出:
Hello World.
Today in the U.S.A., it is a nice day!
Hurrah!
The U.S. is a great country.
Here it comes...
Party time!
Run Code Online (Sandbox Code Playgroud)