我正在使用优秀的tidytext包来标记几个段落中的句子。例如,我想摘录以下一段:
“我完全相信达西先生没有缺陷。他自己毫不掩饰地拥有它。”
并将其标记为两个句子
但是,当我使用默认句子标记器时,tidytext我得到三个句子。
代码
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
Run Code Online (Sandbox Code Playgroud)
结果
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
Run Code Online (Sandbox Code Playgroud)
有什么简单的方法可以用来tidytext标记句子,但不会遇到常见缩写(例如“Mr.”)的问题?或“博士”。被解释为句子结尾?
您可以使用正则表达式作为分割条件,但不能保证这将包括所有常见的恐怖情况:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")
Run Code Online (Sandbox Code Playgroud)
结果:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
Run Code Online (Sandbox Code Playgroud)
当然,您始终可以创建自己的常见标题列表,并根据该列表创建正则表达式:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
Run Code Online (Sandbox Code Playgroud)
语料库和Quanteda在确定句子边界时对缩写都有特殊处理。以下是如何使用语料库分割句子:
library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
text_split(df$Example_Text, "sentences")
## parent index text
## 1 1 1 I am perfectly convinced by it that Mr. Darcy has no defect.
## 2 1 2 He owns it himself without disguise.
Run Code Online (Sandbox Code Playgroud)
如果您想坚持使用unnest_tokens,但想要更详尽的英语缩写列表,您可以遵循 @useR 的建议,但使用语料库缩写列表(其中大部分取自 Common Locale Data Repository):
abbrevations_en
## [1] "A." "A.D." "a.m." "A.M." "A.S." "AA."
## [7] "AB." "Abs." "AD." "Adj." "Adv." "Alt."
## [13] "Approx." "Apr." "Aug." "B." "B.V." "C."
## [19] "C.F." "C.O.D." "Capt." "Card." "cf." "Col."
## [25] "Comm." "Conn." "Cont." "D." "D.A." "D.C."
## (etc., 155 total)
Run Code Online (Sandbox Code Playgroud)