如何在Rvest包中提交没有按钮参数的登录表单

and*_*ndy 12 forms r web-scraping rvest

我正在尝试使用rvest包中的html_session()和html_form()来抓取需要身份验证的网页.我发现这个例如由Hadley Wickham提供,但我无法根据我的情况进行自定义.

united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
         html_nodes("form") %>%
         extract2(1) %>%
         html_form() %>%
         set_values(
                `ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
                `ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>% 
submit_form(login, "ctl00$ContentInfo$SignInSecure")
Run Code Online (Sandbox Code Playgroud)

在我的情况下,我找不到要在表单中设置的值,因此我试图给用户并直接传递:set_values("email","password")

我也不知道如何引用提交按钮,所以我尝试了:submit_form(帐号,登录)

我为submit_form函数得到的错误是:名称错误(提交)[[1]]:下标超出范围

如何理解这一点的任何想法表示赞赏.谢谢

Mic*_*ths 11

目前,此问题与程序包中的未解决问题#159相同rvest,这会导致表单中的所有字段都没有type值的问题.此购买可能会在将来的版本中修复.

但是,我们可以通过猴子修补底层函数来解决这个问题rvest:::submit_request.

核心问题是辅助函数is_submit.最初,它的定义如下:

is_submit <- function(x) tolower(x$type) %in% c("submit", 
        "image", "button")
Run Code Online (Sandbox Code Playgroud)

然而,这是合乎逻辑的,它在两种情况下失败:

  1. 没有type元素.
  2. type元素NULL.

这两种情况都发生在美联航登录表单上.我们可以通过在函数内添加两个检查来解决这个问题.

custom.submit_request <- function (form, submit = NULL) 
{
  is_submit <- function(x) {
    if (!exists("type", x) | is.null(x$type)){
      return(F);
    }
    tolower(x$type) %in% c("submit", "image", "button")
  } 
  submits <- Filter(is_submit, form$fields)
  if (length(submits) == 0) {
    stop("Could not find possible submission target.", call. = FALSE)
  }
  if (is.null(submit)) {
    submit <- names(submits)[[1]]
    message("Submitting with '", submit, "'")
  }
  if (!(submit %in% names(submits))) {
    stop("Unknown submission name '", submit, "'.\n", "Possible values: ", 
         paste0(names(submits), collapse = ", "), call. = FALSE)
  }
  other_submits <- setdiff(names(submits), submit)
  method <- form$method
  if (!(method %in% c("POST", "GET"))) {
    warning("Invalid method (", method, "), defaulting to GET", 
            call. = FALSE)
    method <- "GET"
  }
  url <- form$url
  fields <- form$fields
  fields <- Filter(function(x) length(x$value) > 0, fields)
  fields <- fields[setdiff(names(fields), other_submits)]
  values <- pluck(fields, "value")
  names(values) <- names(fields)
  list(method = method, encode = form$enctype, url = url, values = values)
}
Run Code Online (Sandbox Code Playgroud)

要修补补丁,我们需要使用该R.utils软件包(install.packages("R.utils")如果你没有它,请安装).

library(R.utils)

reassignInPackage("submit_request", "rvest", custom.submit_request)
Run Code Online (Sandbox Code Playgroud)

从那里,我们可以发出自己的请求.

account <- account %>% 
     submit_form(login, "ctl00$ContentInfo$SignInSecure")
Run Code Online (Sandbox Code Playgroud)

这有效!

(好吧,"工作"是用词不当.由于美联航采用更积极的身份验证要求 - 包括已知的浏览器 - 这导致a 301 Unauthorized.但是,它修复了错误).

完整的可重复示例涉及一些其他次要代码更改:

library(magrittr)
library(rvest)

url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
  html_nodes("form") %>%
  extract2(1) %>%
  html_form() %>%
  set_values(
    `ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
    `ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>% 
  submit_form(login, "ctl00$ContentInfo$SignInSecure")
Run Code Online (Sandbox Code Playgroud)