Fro*_*own 7 regex r opennlp tm
我使用R 从文本中提取包含特定人名的句子,这里是一个示例段落:
作为蒂宾根的改革者,他接受了由他的叔叔Johann Reuchlin推荐的Martin Luther对维滕贝格大学的电话.Melanchthon在21岁时成为维滕贝格的希腊语教授.他研究了圣经,特别是保罗和福音派教义.他作为旁观者出席了莱比锡(1519)的辩论,但参与了他的评论.约翰·埃克(Johann Eck)攻击了他的观点,梅兰克顿(Melanchthon)在他的Defensio对手Johannem Eckium的基础上回复了圣经的权威.
在这个短段中,有几个人的名字,如: Johann Reuchlin,Melanchthon,Johann Eck.在openNLP软件包的帮助下,可以正确地提取和识别Martin Luther,Paul和Melanchthon三个人的名字.然后我有两个问题:
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Run Code Online (Sandbox Code Playgroud)
或者更清洁一点:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
Run Code Online (Sandbox Code Playgroud)
如果您正在寻找每个人所处的句子作为单独的回报,那么:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Run Code Online (Sandbox Code Playgroud)
编辑3:要添加每个人的姓名,请执行以下操作:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
Run Code Online (Sandbox Code Playgroud)
如果你想找到有多个人/地方/事物(单词)的句子,那么只需为这两个人添加一个参数,例如:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
Run Code Online (Sandbox Code Playgroud)
并perl改为TRUE:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Run Code Online (Sandbox Code Playgroud)
鉴于:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
Run Code Online (Sandbox Code Playgroud)
会给你双括号内的单词.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
Run Code Online (Sandbox Code Playgroud)