使用 unnest_tokens() 对句子进行标记,忽略缩写

bsc*_*idr 4 text r tidytext

我正在使用优秀的tidytext包来标记几个段落中的句子。例如,我想摘录以下一段:

“我完全相信达西先生没有缺陷。他自己毫不掩饰地拥有它。”

并将其标记为两个句子

  1. “我完全相信达西先生没有缺陷。”
  2. “他毫不掩饰地拥有它。”

但是,当我使用默认句子标记器时,tidytext我得到三个句子。

代码

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
Run Code Online (Sandbox Code Playgroud)

结果

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.
Run Code Online (Sandbox Code Playgroud)

有什么简单的方法可以用来tidytext标记句子,但不会遇到常见缩写(例如“Mr.”)的问题?或“博士”。被解释为句子结尾?

avi*_*seR 6

您可以使用正则表达式作为分割条件,但不能保证这将包括所有常见的恐怖情况:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\\b\\p{L}r)\\.")
Run Code Online (Sandbox Code Playgroud)

结果:

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise
Run Code Online (Sandbox Code Playgroud)

当然,您始终可以创建自己的常见标题列表,并根据该列表创建正则表达式:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)
Run Code Online (Sandbox Code Playgroud)

  • 不幸的是,这个解决方案因以“引号”结尾的句子而中断。(在美国印刷术中,我们将结束标点符号放在引号内。)如果您要删除标点符号,这可能会或可能不重要。 (2认同)

Pat*_*rry 5

语料库和Quanteda在确定句子边界时对缩写都有特殊处理。以下是如何使用语料库分割句子:

library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))

text_split(df$Example_Text, "sentences")
##   parent index text                                                         
## 1 1          1 I am perfectly convinced by it that Mr. Darcy has no defect. 
## 2 1          2 He owns it himself without disguise.
Run Code Online (Sandbox Code Playgroud)

如果您想坚持使用unnest_tokens,但想要更详尽的英语缩写列表,您可以遵循 @useR 的建议,但使用语料库缩写列表(其中大部分取自 Common Locale Data Repository):

abbrevations_en
##  [1] "A."       "A.D."     "a.m."     "A.M."     "A.S."     "AA."       
##  [7] "AB."      "Abs."     "AD."      "Adj."     "Adv."     "Alt."    
## [13] "Approx."  "Apr."     "Aug."     "B."       "B.V."     "C."      
## [19] "C.F."     "C.O.D."   "Capt."    "Card."    "cf."      "Col."    
## [25] "Comm."    "Conn."    "Cont."    "D."       "D.A."     "D.C."    
## (etc., 155 total)
Run Code Online (Sandbox Code Playgroud)