如何使用OpenNLP在R中获取POS标签?

use*_*599 5 nlp r text-mining pos-tagger opennlp

这是R代码:

library(NLP) 
library(openNLP)
tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <-  tagPOS(str)
Run Code Online (Sandbox Code Playgroud)

输出是:

tagged_str $ POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."

现在我想从上面的句子中只提取NN单词即句子,并希望将其存储到变量中.任何人都可以帮我解决这个问题.

Ken*_*oit 6

这是一个更通用的解决方案,您可以使用正则表达式描述要提取的Treebank标记.例如,在你的情况下,"NN"返回所有名词类型(例如NN,NNS,NNP,NNPS),而"NN $"仅返回NN.

它以字符类型运行,因此如果您将文本作为列表,则需要lapply()按照下面的示例进行操作.

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

extractPOS <- function(x, thisPOSregex) {
    x <- as.String(x)
    wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
    POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
    POSwords <- subset(POSAnnotation, type == "word")
    tags <- sapply(POSwords$features, '[[', "POS")
    thisPOSindex <- grep(thisPOSregex, tags)
    tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
    untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
    untokenizedAndTagged
}

lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
## 
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
## 
## [[2]]
## [1] ""
Run Code Online (Sandbox Code Playgroud)


RHe*_*tel 2

可能有更优雅的方法来获得结果,但这个应该可行:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助。