efl*_*s89 3 text r text-mining uppercase
我有许多大文本文件,其基本组成如下:
text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
Run Code Online (Sandbox Code Playgroud)
如您所见,它由以下内容组成:1)随机文本,2)大写字母,3)语音.
我已设法使用以下列表将所有单词分开:
textw<-unlist(strsplit(text," "))
Run Code Online (Sandbox Code Playgroud)
然后我找到大写单词的所有位置:
grep(pattern = "^[[:upper:]]*$",x = textw)
Run Code Online (Sandbox Code Playgroud)
我把人的名字分成了一个载体;
upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
Run Code Online (Sandbox Code Playgroud)
期望的结果将是这样的数据框架或表格:
Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))
Result
person message
1 this is a speech test.
2 FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON thank you for inviting us
Run Code Online (Sandbox Code Playgroud)
我无法将每条消息"链接"到它的作者身上.
还要注意:有大写单词不是作者,例如"我".如果只有2个或更多个大写单词彼此相邻,我怎么能指定分隔?
换句话说,如果位置2和3是大写字母,则将消息放在从位置4到下一次出现的双大写字母的所有内容中.
任何帮助赞赏.
这是使用stringi包的一种方法:
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))
data.frame(
person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Run Code Online (Sandbox Code Playgroud)