网络刮痧与R

CKr*_*Kre 3 r web-crawler web-scraping

我在从网站上抓取数据时遇到了一些问题.首先,我没有很多关于webscraping的经验......我的计划是使用以下网站的R来获取一些数据:http://spiderbook.com/company/17495/details ?rel = 300795

特别是,我想提取本网站上文章的链接.

我的想法到目前为止:

xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext,  "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) 
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x)) 
Run Code Online (Sandbox Code Playgroud)

但这并没有提出预期的信息.这里有一些帮助真的很感激!谢谢!

最好的Christoph

jlh*_*ard 8

你选择了一个难以学习的问题.

此站点使用javascript加载文章信息.换句话说,链接加载一组脚本,这些脚本在页面加载时运行以获取信息(可能来自数据库)并将其插入DOM.htmlParse(...)只是抓住基础html并解析它.所以你想要的链接根本就不存在.

AFAIK唯一的解决方法就是使用该RSelenium包.这个包本质上允许你通过看起来像浏览器模拟器的传递基础html,它运行脚本.问题Rselenium是您不仅需要下载软件包,还需要"Selenium Server".这个链接有一个很好的介绍RSelenium.

完成后,在浏览器中检查源代码会显示文章链接都在href锚标记的属性中class=doclink.这很容易使用xPath进行提取.绝不能永远不要使用正则表达式来解析XML.

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer()        # download Selenium Server, if not already presnet
startServer()           # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open()            # open connection
remDr$navigate(url)     # grab and process the page (including scripts)
doc   <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"                                                                                    
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
# [7] "http://www.calcharge.org/2014/07/"                                                                                    
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
Run Code Online (Sandbox Code Playgroud)


jdh*_*son 6

由于@jihoward笔记RSelenium将解决此问题,并且不需要检查基础网站的净流量/解剖以找到适当的数量.另外我会注意到,如果安装在用户系统上,则RSelenium可以运行.在这种情况下可以直接驾驶.有关无头浏览的小插图,请访问http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.htmlSelenium ServerphantomjsRSeleniumphantomjs

使用浏览器检查Web流量

在这种情况下,检查流量产生以下json文件被称为http://spiderbook.com/company/details/docs?rel=300795&docs_page=0并且它不受cookie保护或对用户代理字符串等敏感.在这种情况下,可以完成以下操作:

library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"                                                                                    
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
[7] "http://www.calcharge.org/2014/07/"                                                                                    
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"      
Run Code Online (Sandbox Code Playgroud)

检查网络流量为phantomJS编写一个简单的功能

随着RSeleniumphantomJS我们也可以使用它来检查动态交通(目前只在直接驾驶phantomJS时).作为一个简单的例子,我们从我们正在查看的当前网页记录请求和接收的呼叫,并将其存储在当前工作目录的"traffic.txt"中:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
             var fs = require(\"fs\");
             fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
             page.onResourceRequested = function(request) {
                fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
             };
             page.onResourceReceived = function(response) {
                fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
             };"

result <- remDr$phantomExecute(psScript)

remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"                                                        
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"      
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"      
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"        
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"          

> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

pJS$stop() # stop phantomJS
Run Code Online (Sandbox Code Playgroud)

在这里,我们可以看到收到的文件之一"Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0".

使用phantomJS/ghostdriver内置HAR支持检查流量

事实上,phantomJS/ghostscript创建自己的HAR文件,只需在我们开车时浏览页面,phantomJS我们就可以访问所有请求/响应数据:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
# HAR contain alot of detail will just illustrate here accessing the data
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))

> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
Run Code Online (Sandbox Code Playgroud)

因此,最后一个示例显示了由生成的HAR文件进行interogating phantomJS以查找XMLHttpRequest请求,然后返回特定的url,我们希望这些url对应于我们在答案开头找到的json文件.