按句子分割文本，但不按特殊模式分割文本

Question

按句子分割文本，但不按特殊模式分割文本

这是我的示例文本：

text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."

Run Code Online (Sandbox Code Playgroud)

我有一个按句子分割文本的功能

library(stringi)
split_by_sentence <- function (text) {

  # split based on periods, exclams or question marks
  result <- unlist(strsplit(text, "\\.\\s|\\?|!") )

  result <- stri_trim_both(result)
  result <- result [nchar (result) > 0]

  if (length (result) == 0)
    result <- ""

  return (result)
}

Run Code Online (Sandbox Code Playgroud)

它实际上是按标点符号分隔的。这是输出：

> split_by_sentence(text)
[1] "First sentence"            "This is a second sentence" "I like pets e.g"           "cats or birds."

Run Code Online (Sandbox Code Playgroud)

是否有可能排除“eg”等特殊模式？

Answer 1

Cat*_*ath 4

在您的模式中，您可以指定要在任何后跟空格的标点符号处进行分割，前提是其前面至少有 2 个字母数字字符（使用环视）。这将导致：

unlist(strsplit(text, "(?<=[[:alnum:]]{3})[?!.]\\s", perl=TRUE))
#[1] "First sentence"                  "This is a second sentence"       "I like pets e.g. cats or birds."

Run Code Online (Sandbox Code Playgroud)

如果你想保留标点符号，那么你可以在look-behind中添加模式，并且只在空格上分割：

unlist(strsplit(text, "(?<=[[:alnum:]]{3}[[?!.]])\\s", perl=TRUE))
# [1] "First sentence."                 "This is a second sentence."      "I like pets e.g. cats or birds."

text2 <- "I like pets (cats and birds) and horses. I have 1.8 bn. horses."

unlist(strsplit(text2, "(?<=[[:alnum:]]{3}[?!.])\\s", perl=TRUE))
#[1] "I like pets (cats and birds) and horses." "I have 1.8 bn. horses."

Run Code Online (Sandbox Code Playgroud)

注意：如果标点符号后可能有多个空格，则可以在模式中放置\\s+代替\\s

归档时间：	8 年，5 月前
查看次数：	977 次
最近记录：	4 年，6 月前