我正在尝试自动登录英国的数据存档服务.该网站显然值得信赖.不幸的是,RCurl与httr在SSL验证休息.我的网络浏览器不会发出任何警告.我可以通过使用来解决这个问题ssl.verifypeer = FALSE,RCurl但我想了解发生了什么?
# breaks
library(httr)
GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )
# breaks
library(RCurl)
cert <- system.file("CurlSSL/cacert.pem", package = "RCurl")
getURL("https://www.esds.ac.uk/secure/UKDSRegister_start.asp",cainfo = cert)
# works
library(RCurl)
getURL(
"https://www.esds.ac.uk/secure/UKDSRegister_start.asp" ,
.opts = list(ssl.verifypeer = FALSE)
) # note: use list(ssl.verifypeer = FALSE,followlocation=TRUE) to see content
Run Code Online (Sandbox Code Playgroud) 我在尝试下载PDF时遇到了问题.
例如,如果我在考古数据服务上有PDF文档的DOI,它将解析到此着陆页, 其中包含嵌入链接到此pdf,但它真正重定向到此其他链接.
library(httr)将处理解析DOI,我们可以使用登陆页面提取PDF格式的URL,library(XML)但我一直坚持获取PDF本身.
如果我这样做:
download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
Run Code Online (Sandbox Code Playgroud)
然后我收到一个与http://archaeologydataservice.ac.uk/myads/相同的HTML文件
尝试使用R如何从需要cookie的SSL页面下载压缩文件的答案引导我:
library(httr)
terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")
# Accept the terms on the form,
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)
# Actually download the file (this will take a while)
resp <- GET(download, query = values)
# write the content …Run Code Online (Sandbox Code Playgroud)