Tyl*_*ker 10 parallel-processing r
我试图并行使用openNLP/NLP包中的词性标注.我需要代码在任何操作系统上工作,所以我选择parLapply
从并行使用该函数(但我可以使用其他独立于操作系统的选项).过去我tagPOS
从openNLP包运行函数parLapply
没有问题.但是,openNLP软件包最近有一些更改消除了,tagPOS
并添加了一些更灵活的选项.Kurt非常友好地帮助我tagPOS
从新软件包的工具中重新创建了这个功能.我可以让lapply
版本工作但不是并行版本.它一直说节点需要传递给它们的更多变量,直到它最终要求openNLP中的非导出函数.这似乎很奇怪它会不断要求传递越来越多的变量,这告诉我我设置parLapply
错误.如何设置tagPOS
以并行,独立于操作系统的方式运行?
library(openNLP)
library(NLP)
library(parallel)
## POS tagger
tagPOS <- function(x, pos_tag_annotator, ...) {
s <- as.String(x)
## Need sentence and word token annotations.
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, pos_tag_annotator, a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
## Extract token/POS pairs (all of them): easy.
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
} ## End of tagPOS function
## Set up a parallel run
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
ntv <- length(text.var)
PTA <- Maxent_POS_Tag_Annotator()
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterExport(cl=cl, varlist=c("text.var", "ntv",
"tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"),
envir = environment())
m <- parLapply(cl, seq_len(ntv), function(i) {
x <- tagPOS(text.var[i], PTA)
return(x)
}
)
stopCluster(cl)
## Error in checkForRemoteErrors(val) :
## 3 nodes produced errors; first error: could not find function
## "Maxent_Simple_Word_Tokenizer"
openNLP::Maxent_Simple_Word_Tokenizer
## >openNLP::Maxent_Simple_Word_Tokenizer
## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported
## object from 'namespace:openNLP'
## It's a non exported function
openNLP:::Maxent_Simple_Word_Tokenizer
## Demo that it works with lapply
lapply(seq_len(ntv), function(i) {
tagPOS(text.var[i], PTA)
})
lapply(text.var, function(x) {
tagPOS(x, PTA)
})
## > lapply(seq_len(ntv), function(i) {
## + tagPOS(text.var[i], PTA)
## + })
## [[1]]
## [[1]]$POStagged
## [1] "I/PRP like/IN it/PRP ./."
##
## [[1]]$POStags
## [1] "PRP" "IN" "PRP" "."
##
## [[1]]$word.count
## [1] 3
##
##
## [[2]]
## [[2]]$POStagged
## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/."
##
## [[2]]$POStags
## [1] "DT" "VBZ" "JJ" "NN" "."
##
## [[2]]$word.count
## [1] 4
##
##
## [[3]]
## [[3]]$POStagged
## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./."
##
## [[3]]$POStags
## [1] "PRP" "RB" "MD" "VB" "DT" "NN" "."
##
## [[3]]$word.count
## [1] 6
Run Code Online (Sandbox Code Playgroud)
编辑:根据史蒂夫的建议
请注意,openNLP是全新的.我从CRAN的tar.gz安装了ver 2.1.即使此函数存在,我也会收到以下错误.
library(openNLP); library(NLP); library(parallel)
tagPOS <- function(text.var, pos_tag_annotator, ...) {
s <- as.String(text.var)
## Set up the POS annotator if missing (for parallel)
if (missing(pos_tag_annotator)) {
PTA <- Maxent_POS_Tag_Annotator()
}
## Need sentence and word token annotations.
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, PTA, a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, "[[", "POS"))
## Extract token/POS pairs (all of them): easy.
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {library(openNLP); library(NLP)})
m <- parLapply(cl, text.var, tagPOS)
## > m <- parLapply(cl, text.var, tagPOS)
## Error in checkForRemoteErrors(val) :
## 3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator"
stopCluster(cl)
> packageDescription('openNLP')
Package: openNLP
Encoding: UTF-8
Version: 0.2-1
Title: Apache OpenNLP Tools Interface
Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email =
"Kurt.Hornik@R-project.org")
Description: An interface to the Apache OpenNLP tools (version 1.5.3). The Apache OpenNLP
library is a machine learning based toolkit for the processing of natural language
text written in Java. It supports the most common NLP tasks, such as tokenization,
sentence segmentation, part-of-speech tagging, named entity extraction, chunking,
parsing, and coreference resolution. See http://opennlp.apache.org/ for more
information.
Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3)
SystemRequirements: Java (>= 5.0)
License: GPL-3
Packaged: 2013-08-20 13:23:54 UTC; hornik
Author: Kurt Hornik [aut, cre]
Maintainer: Kurt Hornik <Kurt.Hornik@R-project.org>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-08-20 15:41:22
Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows
Run Code Online (Sandbox Code Playgroud)
由于您要从NLP
群集工作者调用函数,因此应在调用之前将其加载到每个worker上parLapply
.您可以从worker函数执行此操作,但我倾向于使用clusterCall
或clusterEvalQ
在创建集群对象后立即使用:
clusterEvalQ(cl, {library(openNLP); library(NLP)})
Run Code Online (Sandbox Code Playgroud)
由于as.String
并且Maxent_Word_Token_Annotator
在这些包中,因此不应导出它们.
请注意,在我的计算机上运行示例时,我注意到该PTA
对象在导出到工作计算机后不起作用.据推测,该对象中有一些东西无法安全地序列化和反序列化.在使用worker创建该对象后clusterEvalQ
,示例成功运行.在这里,使用openNLP 0.2-1:
library(parallel)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, PTA, a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
library(openNLP)
library(NLP)
PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)
Run Code Online (Sandbox Code Playgroud)
如果clusterEvalQ
由于找不到Maxent_POS_Tag_Annotator而失败,则可能是在工作者上加载了错误版本的openNLP.你可以决定你通过执行获得的工人什么包的版本sessionInfo
有clusterEvalQ
:
library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())
Run Code Online (Sandbox Code Playgroud)
这将返回sessionInfo()
每个集群工作程序的执行结果.以下是我正在使用的一些软件包的版本信息,这对我有用:
other attached packages:
[1] NLP_0.1-0 openNLP_0.2-1
loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4
Run Code Online (Sandbox Code Playgroud)