在R中按大写解析文本

efl*_*s89 3 text r text-mining uppercase

我有许多大文本文件,其基本组成如下:

text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
Run Code Online (Sandbox Code Playgroud)

如您所见,它由以下内容组成:1)随机文本,2)大写字母,3)语音.

我已设法使用以下列表将所有单词分开:

textw<-unlist(strsplit(text," "))
Run Code Online (Sandbox Code Playgroud)

然后我找到大写单词的所有位置:

grep(pattern = "^[[:upper:]]*$",x = textw)
Run Code Online (Sandbox Code Playgroud)

我把人的名字分成了一个载体;

upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
Run Code Online (Sandbox Code Playgroud)

期望的结果将是这样的数据框架或表格:

Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
         message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))

Result
         person                       message
1                      this is a speech test.
2  FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON     thank you for inviting us
Run Code Online (Sandbox Code Playgroud)

我无法将每条消息"链接"到它的作者身上.

还要注意:有大写单词不是作者,例如"我".如果只有2个或更多个大写单词彼此相邻,我怎么能指定分隔?

换句话说,如果位置2和3是大写字母,则将消息放在从位置4到下一次出现的双大写字母的所有内容中.

任何帮助赞赏.

Tyl*_*ker 8

这是使用stringi包的一种方法:

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))

data.frame(
    person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
    message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)


##          person                       message
## 1          <NA>        this is a speech text.
## 2  FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON     thank you for inviting us
Run Code Online (Sandbox Code Playgroud)

  • `[AZ] {2,1000}`可能被替换为:`[AZ] [AZ] +`,它允许两个人的标识符达到无限. (2认同)