基于并发请求的RCurl爬虫问题

以下是一个脚本,用于重现我在构建具有执行并发请求的RCurl的爬网程序时遇到的问题.目标是下载数千个网站的内容以进行统计分析.因此,解决方案应该扩展.

library(RCurl)
library(httr)

uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com", 
         "p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com", 
         "mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com", 
         "xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar", 
         "android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")

### RCurl Concurrent requests 

getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
  content = list()
  curls = list()
  for(i in uris) {
    curl = getCurlHandle()
    content[[i]] = basicTextGatherer()
    opts = curlOptions(URL = i, writefunction = content[[i]]$update,
                       timeout = 2, maxredirs = 3, verbose = TRUE,
                       followLocation = TRUE,...)
    curlSetOpt(.opts = opts, curl = curl)
    multiHandle = push(multiHandle, …

Run Code Online (Sandbox Code Playgroud)

r rcurl httr

mar*_*bel

2014 10-08

6
推荐指数

1
解决办法

7494
查看次数

如何正确设置 cookie 以使用 httr 获取 URL 内容

我需要从使用 cookie 保护的网站下载信息。我手动传递此保护，然后将 cookie 插入到httr.

这是类似的主题，但它没有解决我的问题：（为 httr 复制 cookie）

library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"

cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"

response <- GET(url, config(cookie= cook))

content(x = response,as = 'text', encoding = "UTF-8")

Run Code Online (Sandbox Code Playgroud)

因此，当我使用内容时，它会返回我未登录的信息（就像没有 cookie 时那样）

我怎么解决这个问题？

测试凭据是 login: mytest2, pass: qwerty12)

cookies r httr

Vad*_*ymB

2017 05-23

6
推荐指数

1
解决办法

5296
查看次数

从R连接到azure blob存储API时出错

我试图通过R中的REST API使用Azure存储.我正在使用httr覆盖Curl 的包.

建立

您可以使用R-fiddle:http://www.r-fiddle.org/#/fiddle？id = vh8uqGmM

library(httr)
requestdate<-format(Sys.time(),"%a, %d %b %Y %H:%M:%S GMT")
url<-"https://preconstuff.blob.core.windows.net/pings?restype=container&comp=list"
sak<-"Q8HvUVJLBJK+wkrIEG6LlsfFo19iDjneTwJxX/KXSnUCtTjgyyhYnH/5azeqa1bluGD94EcPcSRyBy2W2A/fHQ=="
signaturestring<-paste0("GET",paste(rep("\n",12),collapse=""),
"x-ms-date:",requestdate,"
x-ms-version:2009-09-19
/preconstuff/pings
comp:list
restype:container")

headerstuff<-add_headers(Authorization=paste0("SharedKey preconstuff:",
                         RCurl::base64(digest::hmac(key=sak,
                         object=enc2utf8(signaturestring),
                         algo= "sha256"))),
                    `x-ms-date`=requestdate,
                    `x-ms-version`= "2009-09-19")

Run Code Online (Sandbox Code Playgroud)

试图列出blob:

content(GET(url,config = headerstuff, verbose() ))

Run Code Online (Sandbox Code Playgroud)

错误

顶级消息

在HTTP请求'Q8HvUVJLBJK + wkrIEG6LlsfFo19iDjneTwJxX /KXSnUCtTjgyyhYnH/5azeqa1bluGD94EcPcSRyBy2W2A/fHQ =='中找到的MAC签名与任何计算签名不同.

回复内容

[1] "<?xml version=\"1.0\" encoding=\"utf-8\"?><Error>
<Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\nRequestId:1ab26da5-0001-00dc-6ddb-15e35c000000\nTime:2015-03-26T17:51:42.7190620Z</Message>
<AuthenticationErrorDetail>The MAC signature found in …

Run Code Online (Sandbox Code Playgroud)

r azure-storage httr

Ste*_*cke

2017 05-23

6
推荐指数

1
解决办法

3271
查看次数

成功将分页JSON对象强制转换为R数据帧

我试图将从API中提取的JSON转换为R中的数据帧,以便我可以使用和分析数据.

#Install needed packages
require(RJSONIO)
require(httr)

#request a list of companies currently fundraising using httr
r <- GET("https://api.angel.co/1/startups?filter=raising")
#convert to text object using httr
raise <- content(r, as="text")
#convert to list using RJSONIO
fromJSON(raise) -> new

Run Code Online (Sandbox Code Playgroud)

一旦我得到这个对象,new我就很难将列表解析成数据帧.json有这样的结构:

{
  "startups": [
 {
  "id": 6702,
  "name": "AngelList",
  "quality": 10,
  "...": "...",
  "fundraising": {
    "round_opened_at": "2013-07-30",
    "raising_amount": 1000000,
    "pre_money_valuation": 2000000,
    "discount": null,
    "equity_basis": "equity",
    "updated_at": "2013-07-30T08:14:40Z",
    "raised_amount": 0.0
      }
    }
  ],
  "total": 4268 ,
  "per_page": 50,
  "page": 1,
  "last_page": …

Run Code Online (Sandbox Code Playgroud)

r httr jsonlite

ver*_*his

2015 05-03

6
推荐指数

2
解决办法

3509
查看次数

如何使用httr GET命令刷新或重试特定网页？

我需要使用不同的"密钥"访问同一网页,以获取它提供的特定内容.

我有一个键列表,x我使用包中的GET命令httr来访问网页,然后检索我需要的信息y.

library(httr)
library(stringr)
library(XML)

for (i in 1:20){
    h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
    par = htmlParse(file = h1)

    y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)

}

Run Code Online (Sandbox Code Playgroud)

问题是经常会达到超时,并且会中断循环.

因此,如果达到超时,我想刷新网页或重试GET命令,因为我怀疑问题出在我试图访问的网站的互联网连接上.

我的代码工作方式,超时打破了循环.我需要忽略错误并转到下一次迭代或重试访问该网站.

timeout get r httr

Fel*_*nga

2018 04-13

6
推荐指数

2
解决办法

1338
查看次数

使用API计算R内两个机场(两列)之间的距离？

我想知道是否有办法比较机场距离(IATA代码).有一些脚本,但没有使用R.所以我尝试使用API:

developer.aero

示例数据:

library(curl) # for curl post

departure <- c("DRS","TXL","STR","DUS","LEJ","FKB","LNZ")
arrival <- c("FKB","HER","BOJ","FUE","PMI","AYT","FUE")
flyID <- c(1,2,3,4,5,6,7)
df <- data.frame(departure,arrival,flyID)  

     departure arrival flyID
1       DRS     FKB     1
2       TXL     HER     2
3       STR     BOJ     3
4       DUS     FUE     4
5       LEJ     PMI     5
6       FKB     AYT     6
7       LNZ     FUE     7

api<- curl_fetch_memory("https://airport.api.aero/airport/distance/DRS/FUE?user_key=d805e84363494ca03b9b52d5a505c4d1")

cat(rawToChar(api$content))

callback({"processingDurationMillis":0,"authorisedAPI":true,"success":true,"airline":null,"errorMessage":null,"distance":"3,416.1","units":"km"})

Run Code Online (Sandbox Code Playgroud)

其中DRS对应于出发和FUE到达机场

所以我虽然循环df并粘贴到url.然而,对于R - Newbie而言,这似乎有些困难

df$distance<- list(length = nrow(df))
for (i in 1:nrow(df)){
  url <- paste0("https://airport.api.aero/airport/distance/", i, "FUE   ?user_key=d805e84363494ca03b9b52d5a505c4d1")
  myData[[i]] <- read.table(url, header=T,sep="|")
} …

Run Code Online (Sandbox Code Playgroud)

r rcurl httr

Mic*_*her

2016 06-02

6
推荐指数

2
解决办法

1655
查看次数

如何从Imports中列出的R包中覆盖导出的函数

我的包的DESCRIPTION文件httr在Imports指令中有:

Imports:
    httr (>= 1.1.0),
    jsonlite,
    rstudioapi

Run Code Online (Sandbox Code Playgroud)

httr 出口S3方法length.path.

S3method(length,path)

Run Code Online (Sandbox Code Playgroud)

它被定义为:

#' @export
length.path <- function(x) file.info(x)$size

Run Code Online (Sandbox Code Playgroud)

在我的包中,我有一些对象,我将类指定为"路径".每次我将类"path"分配给任何对象时,无论我是否调用length()该对象,都会将其打印到stdout:

Error in file.info(x) : invalid filename argument

Run Code Online (Sandbox Code Playgroud)

以下是每个人都可以运行的一些可重现的代码:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.1

> thing = …

Run Code Online (Sandbox Code Playgroud)

namespaces r httr r-package

Nic*_*ite

lucky-day

6
推荐指数

1
解决办法

446
查看次数

如何删除 httr::GET 保留的 cookie？

httr::GET 在向同一网站发出请求时保留 cookie。

是否可以查询那些保留的 cookie？
如何刷新那些保存的 cookie 并再次发出“原始”请求？

例子：

# Get login cookie
r1 <- GET("https://some.url/login", authenticate("foo", "bar"))

cookies(r1)
# returns a data frame of two cookies

# Make request that requires authentication cookie
# Only succeeds if r1 was made
r2 <- GET("https://some.url/data/?query&subset=1")
r2

Run Code Online (Sandbox Code Playgroud)

请注意，制作时r2您不必明确传递任何 cookie 信息，因为它们会自动存储在某处。

我想知道如何查询或删除这些存储的 cookie？

r httr

Mic*_*hał

2017 06-16

6
推荐指数

1
解决办法

1216
查看次数

R rvest:找不到函数"xpath_element"

我试图简单地复制示例rvest::html_nodes(),但遇到错误:

library(rvest)
ateam <- read_html("http://www.boxofficemojo.com/movies/?id=ateam.htm")
html_nodes(ateam, "center")

Run Code Online (Sandbox Code Playgroud)

do.call中的错误(方法,列表(parsed_selector)):找不到函数"xpath_element"

同样的情况,如果我打开包,例如httr,xml2,selectr.我似乎也有这些软件包的最新版本......

在该包的功能,例如xpath_element,xpath_combinedselector在什么位置？我如何让它工作？请注意,我在Ubuntu 16.04上运行,因此该代码可能适用于其他平台...

r httr rvest xml2 ubuntu-16.04

Mat*_*fou

lucky-day

6
推荐指数

1
解决办法

1591
查看次数

使用带有R的httr包的oauth2.0令牌

问题

该包httr在R中提供CURL包装(参见包文档).

我是HTTP和API的新手.我的麻烦是让oauth2.0认证工作.我尝试了各种语法规范,并获得错误或状态401.

使用oauth2.0令牌并使用GET()请求的正确方法是什么httr？

代码尝试

# Set UP

  url = "https://canvas.{institution}.edu/api/v1/courses"
  key = "{secret_key}"

# 1
  GET(url, sign_oauth2.0(key)) 
  # Error: Deprecated: supply token object to config directly

# 2
  GET(url, config(sign_oauth2.0 = key)) 
  # unknown option: sign_oauth2.0

# 3
  GET(url, config = list(sign_oauth2.0 = key)) 
  # Status 401

Run Code Online (Sandbox Code Playgroud)

curl r oauth oauth-2.0 httr

Dan*_*lle

lucky-day

6
推荐指数

1
解决办法

3667
查看次数