inf*_*keR 7 xml parallel-processing r
我试图刮掉大量的网页,以便以后分析它们.由于URL的数量巨大,我决定使用该parallel包XML.
具体来说,我正在使用htmlParse()函数from XML,它在使用时工作正常sapply,但在使用时会生成类HTMLInternalDocument的空对象parSapply.
url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}
urls<- c(url1,url2,url3)
#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]
#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}
output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty
Run Code Online (Sandbox Code Playgroud)
谢谢.
ags*_*udy 11
您可以使用getURIAsynchronousRcurl包,允许调用者指定多个URI同时下载.
library(RCurl)
library(XML)
get.asynch <- function(urls){
txt <- getURIAsynchronous(urls)
## this part can be easily parallelized
## I am juste using lapply here as first attempt
res <- lapply(txt,function(x){
doc <- htmlParse(x,asText=TRUE)
xpathSApply(doc,"/html/body/h2[2]",xmlValue)
})
}
get.synch <- function(urls){
lapply(urls,function(x){
doc <- htmlParse(x)
res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
res2
})}
Run Code Online (Sandbox Code Playgroud)
这里有一些100 urls的基准测试,你将解析时间除以2倍.
library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)
Unit: seconds
expr min lq median uq max neval
get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783 1
get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1034 次 |
| 最近记录: |