R中有没有一种方法可以分隔缺少空格的句子，即“句子一.句子二”？

Question

R中有没有一种方法可以分隔缺少空格的句子，即“句子一.句子二”？

Pie*_*ter 3 regex r text-parsing stringr

我从 XML 文件中抓取了一些文本块，这些文本块经常缺少句子之间的空格。我已经str_split成功地将这些块分解成易于理解的句子，如下所示：

list_of_strings <- str_split(chunk_of_text, pattern=boundary("sentence")

Run Code Online (Sandbox Code Playgroud)

这工作得很好，但它不能处理终止句号后面没有空格的情况。例如，"This sentence ends.This sentence continues." 它返回 1 个句子，而不是两个。

使用str_splitwithpattern=boundary("sentence")不起作用。

如果我搜索句点并将其替换为句点空格，当然会弄乱 1.5 磅之类的数字。

我探索过使用通配符来检测情况，例如，

str_view_all(x, "[[:alpha:]]\\.[[:alpha:]]"))

Run Code Online (Sandbox Code Playgroud)

但我不知道如何 1) 在句点后插入一个空格，以便后续对 str_split 的调用正常工作，或 2) 在句点处拆分。

发生这种情况时，有什么关于分隔句子的建议吗？

R程序员新手，感谢您的帮助！

Answer 1

zep*_*ryl 5

library(stringr)\n\nx <- "This sentence ends.This sentence continues. It costs 1.5 pounds.They needed it A.S.A.P.Here's one more sentence."\n\nstr_split(x, "\\\\.\\\\s?(?=[A-Z][^\\\\.])")\n

Run Code Online (Sandbox Code Playgroud)\n

[[1]]\n[1] "This sentence ends"        "This sentence continues"  \n[3] "It costs 1.5 pounds"       "They needed it A.S.A.P"   \n[5] "Here's one more sentence."\n

Run Code Online (Sandbox Code Playgroud)\n

解释：

\n

"\\\\.                     # literal period\n    \\\\s?                 # optional whitespace\n        (?=[A-Z]         # followed by a capital letter \n                [^\\\\.])" # which isn\xe2\x80\x99t followed by another period\n

Run Code Online (Sandbox Code Playgroud)\n

另请注意，这并不能解释所有可能性。例如， it\xe2\x80\x99ll 在"Dr."for之后错误地分割"Dr. Perez is on call."。您可以通过添加负向后查找来处理这种情况：

\n

"(?<!Dr|Mr|Mrs|Ms|Mx)\\\\.\\\\s?(?=[A-Z][^\\\\.])"\n

Run Code Online (Sandbox Code Playgroud)\n

但具体内容以及要处理的其他边缘情况将取决于您的数据。

\n

归档时间：	2 年，10 月前
查看次数：	83 次
最近记录：	2 年，10 月前