相关疑难解决方法(0)

使用并行化来用R刮取网页

我试图刮掉大量的网页,以便以后分析它们.由于URL的数量巨大,我决定使用该parallelXML.

具体来说,我正在使用htmlParse()函数from XML,它在使用时工作正常sapply,但在使用时会生成类HTMLInternalDocument的空对象parSapply.

url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"

myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}

urls<- c(url1,url2,url3)

#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]


#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}

output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty
Run Code Online (Sandbox Code Playgroud)

谢谢.

xml parallel-processing r

7
推荐指数
1
解决办法
1034
查看次数

标签 统计

parallel-processing ×1

r ×1

xml ×1