CKr*_*Kre 3 r web-crawler web-scraping
我在从网站上抓取数据时遇到了一些问题.首先,我没有很多关于webscraping的经验......我的计划是使用以下网站的R来获取一些数据:http://spiderbook.com/company/17495/details ?rel = 300795
特别是,我想提取本网站上文章的链接.
我的想法到目前为止:
xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext, "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " ")))
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x))
Run Code Online (Sandbox Code Playgroud)
但这并没有提出预期的信息.这里有一些帮助真的很感激!谢谢!
最好的Christoph
你选择了一个难以学习的问题.
此站点使用javascript加载文章信息.换句话说,链接加载一组脚本,这些脚本在页面加载时运行以获取信息(可能来自数据库)并将其插入DOM.htmlParse(...)
只是抓住基础html并解析它.所以你想要的链接根本就不存在.
AFAIK唯一的解决方法就是使用该RSelenium
包.这个包本质上允许你通过看起来像浏览器模拟器的传递基础html,它运行脚本.问题Rselenium
是您不仅需要下载软件包,还需要"Selenium Server".这个链接有一个很好的介绍RSelenium
.
完成后,在浏览器中检查源代码会显示文章链接都在href
锚标记的属性中class=doclink
.这很容易使用xPath进行提取.绝不能永远不要使用正则表达式来解析XML.
library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer() # download Selenium Server, if not already presnet
startServer() # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open() # open connection
remDr$navigate(url) # grab and process the page (including scripts)
doc <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
# [7] "http://www.calcharge.org/2014/07/"
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
Run Code Online (Sandbox Code Playgroud)
由于@jihoward笔记RSelenium
将解决此问题,并且不需要检查基础网站的净流量/解剖以找到适当的数量.另外我会注意到,如果安装在用户系统上,则RSelenium
可以运行.在这种情况下可以直接驾驶.有关无头浏览的小插图,请访问http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.htmlSelenium Server
phantomjs
RSelenium
phantomjs
在这种情况下,检查流量产生以下json文件被称为http://spiderbook.com/company/details/docs?rel=300795&docs_page=0并且它不受cookie保护或对用户代理字符串等敏感.在这种情况下,可以完成以下操作:
library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
[7] "http://www.calcharge.org/2014/07/"
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
Run Code Online (Sandbox Code Playgroud)
随着RSelenium
和phantomJS
我们也可以使用它来检查动态交通(目前只在直接驾驶phantomJS时).作为一个简单的例子,我们从我们正在查看的当前网页记录请求和接收的呼叫,并将其存储在当前工作目录的"traffic.txt"中:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
var fs = require(\"fs\");
fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
page.onResourceRequested = function(request) {
fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
};
page.onResourceReceived = function(response) {
fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
};"
result <- remDr$phantomExecute(psScript)
remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"
> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
pJS$stop() # stop phantomJS
Run Code Online (Sandbox Code Playgroud)
在这里,我们可以看到收到的文件之一"Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
.
事实上,phantomJS/ghostscript
创建自己的HAR
文件,只需在我们开车时浏览页面,phantomJS
我们就可以访问所有请求/响应数据:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
# HAR contain alot of detail will just illustrate here accessing the data
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))
> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
Run Code Online (Sandbox Code Playgroud)
因此,最后一个示例显示了由生成的HAR文件进行interogating phantomJS
以查找XMLHttpRequest
请求,然后返回特定的url,我们希望这些url对应于我们在答案开头找到的json文件.
归档时间: |
|
查看次数: |
3589 次 |
最近记录: |