bri*_*enb 6 post r web-scraping httr rvest
我正在尝试以编程方式搜索网站,但提交按钮功能似乎主要由JavaScript提供支持.我并不过分熟悉它是如何工作的,所以我可能错了.
这是我正在使用的代码:
library(rvest)
BASE_URL = 'https://mdocweb.state.mi.us/otis2/otis2.aspx'
PARAMS = list(txtboxLName='Smith',
drpdwnGender='Either',
drpdwnRace='All',
drpdwnStatus='All',
submit='btnSearch')
# rvest approach
s = html_session(BASE_URL)
form = html_form(s)[[1]]
form = set_values(form, PARAMS)
resp = submit_form(s, form, submit='btnSearch') # This gives an error
# httr approach
resp = httr::POST(BASE_URL, body=PARAMS, encode='form')
html = httr::content(resp) # This just returns that same page I was on
Run Code Online (Sandbox Code Playgroud)
该按钮的HTML如下所示:
<input type="submit" name="btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("btnSearch", "", true, "", "", false, false))" language="javascript" id="btnSearch" style="width:100px;">
Run Code Online (Sandbox Code Playgroud)
鉴于该onclick属性,我没有受过教育的假设是JavaScript的使用干扰了我的方法.但同样,我并不完全理解这一切是如何运作的,所以我可能错了.
无论哪种方式,我如何实现我的目标,如果有的话,使用rvest或httr,但不是RSelenium?此外,如果这在Python中可以实现,我也会接受.
我们首先需要获取原始搜索页面,因为这是一个sharepoint站点(或者像一个一样),我们需要一些隐藏的表单字段,以便稍后使用:
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")
setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
str(hidden)
## Named chr [1:3] "x62pLbphYWUDXsdoNdBBNrxqyHHI+K06BzjFwdP3Uooafgey2uG1gLWxzh07djRxiQR724uplZFAI8klbq6HCSkmrp8jP15EMwvkDM/biUEuQrf"| __truncated__ ...
## - attr(*, "names")= chr [1:3] "__VIEWSTATE" "__VIEWSTATEGENERATOR" "__EVENTVALIDATION"
Run Code Online (Sandbox Code Playgroud)
现在,我们需要像表单一样使用HTTP POST来提交它:
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = "Smith",
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res
Run Code Online (Sandbox Code Playgroud)
我们将在一分钟内需要这个辅助函数:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
Run Code Online (Sandbox Code Playgroud)
现在,我们需要结果页面中的HTML:
pg <- content(res, as="parsed")
Run Code Online (Sandbox Code Playgroud)
不幸的是,"表"实际上是一组<div>s.但是,它以编程方式生成并且非常统一.我们不想打字太多,所以让我们先得到我们稍后会使用的列名:
col_names <- html_nodes(pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
## [1] "offender_number" "last_name" "first_name"
## [4] "date_of_birth" "sex" "race"
## [7] "mcl_number" "location" "status"
## [10] "parole_board_jurisdiction_date" "maximum_date" "date_paroled"
Run Code Online (Sandbox Code Playgroud)
该网站非常好,因为它通过提供屏幕阅读器提示来容纳残疾人.不幸的是,这会让人感到不知所措,因为我们要么必须使用值来定位标签,要么稍后清理文本.值得庆幸的是,xml2 现在能够删除节点:
xml_find_all(pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(pg, ".//span[@class='visible-phone']") %>% xml_remove()
Run Code Online (Sandbox Code Playgroud)
我们现在可以收集所有罪犯记录<div>"行":
records <- html_nodes(pg, "div.offenderRow")
Run Code Online (Sandbox Code Playgroud)
并且,简洁地将它们放入数据框:
map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols() %>%
readr::type_convert() -> xdf
xdf
## # A tibble: 25 x 12
## offender_number last_name first_name date_of_birth sex race mcl_number location status
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 544429 SMITH AARICK 12/03/1967 M White 333.74012D3 Gladwin Parole
## 2 210262 SMITH AARON 05/27/1972 M Black <NA> <NA> Dischrg
## 3 372965 SMITH AARON 09/16/1973 M White <NA> <NA> Dischrg
## 4 413411 SMITH AARON 07/13/1973 M Black <NA> <NA> Dischrg
## 5 618210 SMITH AARON 10/12/1984 M Black <NA> <NA> Dischrg
## 6 675823 SMITH AARON 05/19/1989 M Black 333.74032A5 Det Lahser Prob Prob
## 7 759548 SMITH AARON 06/19/1990 M Black <NA> <NA> Dischrg
## 8 763189 SMITH AARON 07/15/1976 M White 333.74032A5 Mt. Pleasant Prob
## 9 854557 SMITH AARON 12/27/1973 M White <NA> <NA> Dischrg
## 10 856804 SMITH AARON 02/24/1989 M White 750.110A2 Harrison CF Prison
## # ... with 15 more rows, and 3 more variables: parole_board_jurisdiction_date <chr>, maximum_date <chr>,
## # date_paroled <chr>
glimpse(xdf)
## Observations: 25
## Variables: 12
## $ offender_number <int> 544429, 210262, 372965, 413411, 618210, 675823, 759548, 763189, 854557, 85...
## $ last_name <chr> "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "S...
## $ first_name <chr> "AARICK", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "...
## $ date_of_birth <chr> "12/03/1967", "05/27/1972", "09/16/1973", "07/13/1973", "10/12/1984", "05/...
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
## $ race <chr> "White", "Black", "White", "Black", "Black", "Black", "Black", "White", "W...
## $ mcl_number <chr> "333.74012D3", NA, NA, NA, NA, "333.74032A5", NA, "333.74032A5", NA, "750....
## $ location <chr> "Gladwin", NA, NA, NA, NA, "Det Lahser Prob", NA, "Mt. Pleasant", NA, "Har...
## $ status <chr> "Parole", "Dischrg", "Dischrg", "Dischrg", "Dischrg", "Prob", "Dischrg", "...
## $ parole_board_jurisdiction_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "11/28/2024", "03/25/2016", NA, NA, NA...
## $ maximum_date <chr> NA, "09/03/2015", "06/29/2016", "10/02/2017", "05/19/2017", "07/18/2019", ...
## $ date_paroled <chr> "11/15/2016", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Run Code Online (Sandbox Code Playgroud)
我曾希望type_convert对日期列(S)WLD提供更好的变换,尤其但它没有,可以很可能被淘汰.
现在,您需要对结果页面进行更多的工作,因为结果是分页的.谢天谢地,您知道页面信息:
xml_integer(html_nodes(pg, "span#lblPgCurrent"))
## [1] 1
xml_integer(html_nodes(pg, "span#lblTotalPgs"))
## [1] 101
Run Code Online (Sandbox Code Playgroud)
你将不得不再次做"隐藏的"舞蹈:
html_nodes(pg, "input[type='hidden']")
Run Code Online (Sandbox Code Playgroud)
(按照上面的参考索引来解决该问题)并重新POST调用一个只有那些隐藏字段和另外一个表单元素的新调用:btnNext = 'Next'.您需要在分页结果集中的所有单个页面上重复此操作,然后最后bind_rows()一切.
我想补充一点,当你弄清楚分页工作流程时,首先要抓住一个新的空白搜索页面.SharePoint服务器似乎有一个非常小的视图状态会话高速缓存超时和代码来配置,如果等待时间过长,迭代之间将打破.
UPDATE
我有点想确保最后一点建议有效,所以这就是:
library(httr)
library(rvest)
library(tidyverse)
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
start_search <- function(last_name) {
pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")
setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = last_name,
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res
content(res, as="parsed")
}
extract_results <- function(results_pg) {
col_names <- html_nodes(results_pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
xml_find_all(results_pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(results_pg, ".//span[@class='visible-phone']") %>% xml_remove()
records <- html_nodes(results_pg, "div.offenderRow")
map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols()
}
current_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblPgCurrent"))
}
last_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblTotalPgs"))
}
scrape_status <- function(results_pg) {
cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)
message(sprintf("%s of %s", cur, tot))
}
next_page <- function(results_pg) {
cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)
if (cur == tot) return(NULL)
setNames(
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = hidden["__EVENTTARGET"],
`__EVENTARGUMENT` = hidden["__EVENTARGUMENT"],
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
btnNext = 'Next'
),
encode = "form"
) -> res
content(res, as="parsed")
}
curr_pg <- start_search("smith")
results_df <- extract_results(curr_pg)
pb <- progress_estimated(last_page_number(curr_pg)-1)
repeat{
scrape_status(curr_pg) # optional esp since we have a progress bar
pb$tick()$print()
curr_pg <- next_page(curr_pg)
if (is.null(curr_pg)) break
results_df <- bind_rows(results_df, extract_results(next_pg))
Sys.sleep(5) # be kind
}
Run Code Online (Sandbox Code Playgroud)
希望你能跟进,但是shd会在给定的搜索词中为你获取所有页面.