相关疑难解决方法(0)

SSL验证导致RCurl和httr中断 - 应该是合法的网站

我正在尝试自动登录英国的数据存档服务.该网站显然值得信赖.不幸的是,RCurlhttr在SSL验证休息.我的网络浏览器不会发出任何警告.我可以通过使用来解决这个问题ssl.verifypeer = FALSE,RCurl但我想了解发生了什么?

# breaks
library(httr)
GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )

# breaks
library(RCurl)
cert <- system.file("CurlSSL/cacert.pem", package = "RCurl")
getURL("https://www.esds.ac.uk/secure/UKDSRegister_start.asp",cainfo = cert)

# works
library(RCurl)
getURL(
    "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" , 
    .opts = list(ssl.verifypeer = FALSE)
) # note: use list(ssl.verifypeer = FALSE,followlocation=TRUE) to see content
Run Code Online (Sandbox Code Playgroud)

ssl curl r rcurl httr

10
推荐指数
1
解决办法
1万
查看次数

使用R接受cookie以下载PDF文件

我在尝试下载PDF时遇到了问题.

例如,如果我在考古数据服务上有PDF文档的DOI,它将解析到此着陆页, 其中包含嵌入链接到此pdf,但它真正重定向到此其他链接.

library(httr)将处理解析DOI,我们可以使用登陆页面提取PDF格式的URL,library(XML)但我一直坚持获取PDF本身.

如果我这样做:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
Run Code Online (Sandbox Code Playgroud)

然后我收到一个与http://archaeologydataservice.ac.uk/myads/相同的HTML文件

尝试使用R如何从需要cookie的SSL页面下载压缩文件的答案引导我:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content …
Run Code Online (Sandbox Code Playgroud)

curl r web-scraping httr

8
推荐指数
1
解决办法
1214
查看次数

标签 统计

curl ×2

httr ×2

r ×2

rcurl ×1

ssl ×1

web-scraping ×1