以下是一个脚本,用于重现我在构建具有执行并发请求的RCurl的爬网程序时遇到的问题.目标是下载数千个网站的内容以进行统计分析.因此,解决方案应该扩展.
library(RCurl)
library(httr)
uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com",
"p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com",
"mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com",
"xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar",
"android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")
### RCurl Concurrent requests
getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
content = list()
curls = list()
for(i in uris) {
curl = getCurlHandle()
content[[i]] = basicTextGatherer()
opts = curlOptions(URL = i, writefunction = content[[i]]$update,
timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE,...)
curlSetOpt(.opts = opts, curl = curl)
multiHandle = push(multiHandle, …Run Code Online (Sandbox Code Playgroud) 我需要从使用 cookie 保护的网站下载信息。我手动传递此保护,然后将 cookie 插入到httr.
这是类似的主题,但它没有解决我的问题:(为 httr 复制 cookie)
library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"
cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
response <- GET(url, config(cookie= cook))
content(x = response,as = 'text', encoding = "UTF-8")
Run Code Online (Sandbox Code Playgroud)
因此,当我使用内容时,它会返回我未登录的信息(就像没有 cookie 时那样)
我怎么解决这个问题?
测试凭据是 login: mytest2, pass: qwerty12)
我试图通过R中的REST API使用Azure存储.我正在使用httr覆盖Curl 的包.
您可以使用R-fiddle:http://www.r-fiddle.org/#/fiddle?id = vh8uqGmM
library(httr)
requestdate<-format(Sys.time(),"%a, %d %b %Y %H:%M:%S GMT")
url<-"https://preconstuff.blob.core.windows.net/pings?restype=container&comp=list"
sak<-"Q8HvUVJLBJK+wkrIEG6LlsfFo19iDjneTwJxX/KXSnUCtTjgyyhYnH/5azeqa1bluGD94EcPcSRyBy2W2A/fHQ=="
signaturestring<-paste0("GET",paste(rep("\n",12),collapse=""),
"x-ms-date:",requestdate,"
x-ms-version:2009-09-19
/preconstuff/pings
comp:list
restype:container")
headerstuff<-add_headers(Authorization=paste0("SharedKey preconstuff:",
RCurl::base64(digest::hmac(key=sak,
object=enc2utf8(signaturestring),
algo= "sha256"))),
`x-ms-date`=requestdate,
`x-ms-version`= "2009-09-19")
Run Code Online (Sandbox Code Playgroud)
试图列出blob:
content(GET(url,config = headerstuff, verbose() ))
Run Code Online (Sandbox Code Playgroud)
在HTTP请求'Q8HvUVJLBJK + wkrIEG6LlsfFo19iDjneTwJxX /KXSnUCtTjgyyhYnH/5azeqa1bluGD94EcPcSRyBy2W2A/fHQ =='中找到的MAC签名与任何计算签名不同.
[1] "<?xml version=\"1.0\" encoding=\"utf-8\"?><Error>
<Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\nRequestId:1ab26da5-0001-00dc-6ddb-15e35c000000\nTime:2015-03-26T17:51:42.7190620Z</Message>
<AuthenticationErrorDetail>The MAC signature found in …Run Code Online (Sandbox Code Playgroud) 我试图将从API中提取的JSON转换为R中的数据帧,以便我可以使用和分析数据.
#Install needed packages
require(RJSONIO)
require(httr)
#request a list of companies currently fundraising using httr
r <- GET("https://api.angel.co/1/startups?filter=raising")
#convert to text object using httr
raise <- content(r, as="text")
#convert to list using RJSONIO
fromJSON(raise) -> new
Run Code Online (Sandbox Code Playgroud)
一旦我得到这个对象,new我就很难将列表解析成数据帧.json有这样的结构:
{
"startups": [
{
"id": 6702,
"name": "AngelList",
"quality": 10,
"...": "...",
"fundraising": {
"round_opened_at": "2013-07-30",
"raising_amount": 1000000,
"pre_money_valuation": 2000000,
"discount": null,
"equity_basis": "equity",
"updated_at": "2013-07-30T08:14:40Z",
"raised_amount": 0.0
}
}
],
"total": 4268 ,
"per_page": 50,
"page": 1,
"last_page": …Run Code Online (Sandbox Code Playgroud) 我需要使用不同的"密钥"访问同一网页,以获取它提供的特定内容.
我有一个键列表,x我使用包中的GET命令httr来访问网页,然后检索我需要的信息y.
library(httr)
library(stringr)
library(XML)
for (i in 1:20){
h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
par = htmlParse(file = h1)
y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)
}
Run Code Online (Sandbox Code Playgroud)
问题是经常会达到超时,并且会中断循环.
因此,如果达到超时,我想刷新网页或重试GET命令,因为我怀疑问题出在我试图访问的网站的互联网连接上.
我的代码工作方式,超时打破了循环.我需要忽略错误并转到下一次迭代或重试访问该网站.
我想知道是否有办法比较机场距离(IATA代码).有一些脚本,但没有使用R.所以我尝试使用API:
示例数据:
library(curl) # for curl post
departure <- c("DRS","TXL","STR","DUS","LEJ","FKB","LNZ")
arrival <- c("FKB","HER","BOJ","FUE","PMI","AYT","FUE")
flyID <- c(1,2,3,4,5,6,7)
df <- data.frame(departure,arrival,flyID)
departure arrival flyID
1 DRS FKB 1
2 TXL HER 2
3 STR BOJ 3
4 DUS FUE 4
5 LEJ PMI 5
6 FKB AYT 6
7 LNZ FUE 7
api<- curl_fetch_memory("https://airport.api.aero/airport/distance/DRS/FUE?user_key=d805e84363494ca03b9b52d5a505c4d1")
cat(rawToChar(api$content))
callback({"processingDurationMillis":0,"authorisedAPI":true,"success":true,"airline":null,"errorMessage":null,"distance":"3,416.1","units":"km"})
Run Code Online (Sandbox Code Playgroud)
其中DRS对应于出发和FUE到达机场
所以我虽然循环df并粘贴到url.然而,对于R - Newbie而言,这似乎有些困难
df$distance<- list(length = nrow(df))
for (i in 1:nrow(df)){
url <- paste0("https://airport.api.aero/airport/distance/", i, "FUE ?user_key=d805e84363494ca03b9b52d5a505c4d1")
myData[[i]] <- read.table(url, header=T,sep="|")
} …Run Code Online (Sandbox Code Playgroud) 我的包的DESCRIPTION文件httr在Imports指令中有:
Imports:
httr (>= 1.1.0),
jsonlite,
rstudioapi
Run Code Online (Sandbox Code Playgroud)
httr 出口S3方法length.path.
S3method(length,path)
Run Code Online (Sandbox Code Playgroud)
它被定义为:
#' @export
length.path <- function(x) file.info(x)$size
Run Code Online (Sandbox Code Playgroud)
在我的包中,我有一些对象,我将类指定为"路径".每次我将类"path"分配给任何对象时,无论我是否调用length()该对象,都会将其打印到stdout:
Error in file.info(x) : invalid filename argument
Run Code Online (Sandbox Code Playgroud)
以下是每个人都可以运行的一些可重现的代码:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.1
> thing = …Run Code Online (Sandbox Code Playgroud) httr::GET 在向同一网站发出请求时保留 cookie。
例子:
# Get login cookie
r1 <- GET("https://some.url/login", authenticate("foo", "bar"))
cookies(r1)
# returns a data frame of two cookies
# Make request that requires authentication cookie
# Only succeeds if r1 was made
r2 <- GET("https://some.url/data/?query&subset=1")
r2
Run Code Online (Sandbox Code Playgroud)
请注意,制作时r2您不必明确传递任何 cookie 信息,因为它们会自动存储在某处。
我想知道如何查询或删除这些存储的 cookie?
我试图简单地复制示例rvest::html_nodes(),但遇到错误:
library(rvest)
ateam <- read_html("http://www.boxofficemojo.com/movies/?id=ateam.htm")
html_nodes(ateam, "center")
Run Code Online (Sandbox Code Playgroud)
do.call中的错误(方法,列表(parsed_selector)):找不到函数"xpath_element"
同样的情况,如果我打开包,例如httr,xml2,selectr.我似乎也有这些软件包的最新版本......
在该包的功能,例如xpath_element,xpath_combinedselector在什么位置?我如何让它工作?请注意,我在Ubuntu 16.04上运行,因此该代码可能适用于其他平台...
该包httr在R中提供CURL包装(参见包文档).
我是HTTP和API的新手.我的麻烦是让oauth2.0认证工作.我尝试了各种语法规范,并获得错误或状态401.
使用oauth2.0令牌并使用GET()请求的正确方法是什么httr?
# Set UP
url = "https://canvas.{institution}.edu/api/v1/courses"
key = "{secret_key}"
# 1
GET(url, sign_oauth2.0(key))
# Error: Deprecated: supply token object to config directly
# 2
GET(url, config(sign_oauth2.0 = key))
# unknown option: sign_oauth2.0
# 3
GET(url, config = list(sign_oauth2.0 = key))
# Status 401
Run Code Online (Sandbox Code Playgroud)