使用 R 中的 Rvest 进行礼貌的网页抓取

Question

使用 R 中的 Rvest 进行礼貌的网页抓取

我有一些代码可以抓取网站，但是在运行了多次抓取之后，我收到了 403 禁止错误。我知道 R 中有一个名为polite的包，它负责弄清楚如何根据主机要求运行抓取，这样就不会出现403。我尽力使其适应我的代码，但我陷入困境。非常感谢一些帮助。这是一些可重现的示例代码，其中只有一些链接：

library(tidyverse)
library(httr) 
library(rvest)
library(curl)

urls = c("https://www.pro-football-reference.com/teams/pit/2021.htm", "https://www.pro- 
football-reference.com/teams/pit/2020.htm", "https://www.pro-football- 
reference.com/teams/pit/2019.htm")


pitt <- map_dfr(
.x = urls,
 .f = function(x) {Sys.sleep(2); cat(1);
 read_html(
  curl(x, handle = curl::new_handle("useragent" = "chrome"))) %>% 
  html_nodes("table") %>% 
  html_table(header = TRUE) %>% 
  simplify() %>%
  .[[2]] %>% 
  janitor::row_to_names(row_number = 1) %>% 
  janitor::clean_names(.) %>% 
  select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% 
  mutate(year = str_extract(string = x, pattern = "\\d{4}"))
 }
)

Run Code Online (Sandbox Code Playgroud)

此运行应该没有问题，但完整运行包括 1933-2021 年的所有年份，而不仅仅是示例中提供的三年链接。我愿意接受任何方式，使用礼貌的包或专家可能更熟悉的任何其他方式来负责任地解决这个问题。

Answer 1

小智 5

这是我的建议，如何在这种情况下使用礼貌。该代码创建了球队和赛季的网格，并礼貌地抓取数据。

\n

解析器取自您的示例。

\n

library(magrittr)\n\n# Create polite session\nhost <- "https://www.pro-football-reference.com/"\nsession <- polite::bow(host, force = TRUE)\n\n# Create grid of teams and seasons that shall be scraped\nseasons <- 2020:2021\nteams <- c("pit", "nor")\ngrid_to_scrape <- tidyr::expand_grid(team = teams, season = seasons)\ngrid_to_scrape\n#> # A tibble: 4 \xc3\x97 2\n#>   team  season\n#>   <chr>  <int>\n#> 1 pit     2020\n#> 2 pit     2021\n#> 3 nor     2020\n#> 4 nor     2021\n\nresponses <- purrr::pmap_dfr(grid_to_scrape, function(team, season, session){\n  # For some verbose status updates\n  cli::cli_process_start("Scrape {.val {team}}, {.val {season}}")\n  # Create full url and scrape\n  full_url <- polite::nod(session, glue::glue("teams/{team}/{season}.htm"))\n  scrape <- polite::scrape(full_url)\n  # Parse the response, suppress Janitor warnings. This is a problem of the parser\n  suppressWarnings({\n    response <- scrape %>% \n      rvest::html_elements("table") %>% \n      rvest::html_table(header = TRUE) %>% \n      purrr::simplify() %>%\n      .[[2]] %>%\n      janitor::row_to_names(row_number = 1) %>% \n      janitor::clean_names() %>% \n      dplyr::select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% \n      dplyr::mutate(year = season, team = team)\n  })\n  # Update status\n  cli::cli_process_done()\n  # return parsed data\n  response\n}, session = session)\n#> \xe2\x84\xb9 Scrape "pit", 2020\n#> \xe2\x9c\x93 Scrape "pit", 2020 ... done\n#> \n#> \xe2\x84\xb9 Scrape "pit", 2021\n#> \xe2\x9c\x93 Scrape "pit", 2021 ... done\n#> \n#> \xe2\x84\xb9 Scrape "nor", 2020\n#> \xe2\x9c\x93 Scrape "nor", 2020 ... done\n#> \n#> \xe2\x84\xb9 Scrape "nor", 2021\n#> \xe2\x9c\x93 Scrape "nor", 2021 ... done\n#> \n\nresponses\n#> # A tibble: 77 \xc3\x97 10\n#>    week  day   date       result record opponent team_score opponent_score  year\n#>    <chr> <chr> <chr>      <chr>  <chr>  <chr>    <chr>      <chr>          <int>\n#>  1 1     "Mon" "Septembe\xe2\x80\xa6 "boxs\xe2\x80\xa6 "1-0"  New Yor\xe2\x80\xa6 "26"       "16"            2020\n#>  2 2     "Sun" "Septembe\xe2\x80\xa6 "boxs\xe2\x80\xa6 "2-0"  Denver \xe2\x80\xa6 "26"       "21"            2020\n#>  3 3     "Sun" "Septembe\xe2\x80\xa6 "boxs\xe2\x80\xa6 "3-0"  Houston\xe2\x80\xa6 "28"       "21"            2020\n#>  4 4     ""    ""         ""     ""     Bye Week ""         ""              2020\n#>  5 5     "Sun" "October \xe2\x80\xa6 "boxs\xe2\x80\xa6 "4-0"  Philade\xe2\x80\xa6 "38"       "29"            2020\n#>  6 6     "Sun" "October \xe2\x80\xa6 "boxs\xe2\x80\xa6 "5-0"  Clevela\xe2\x80\xa6 "38"       "7"             2020\n#>  7 7     "Sun" "October \xe2\x80\xa6 "boxs\xe2\x80\xa6 "6-0"  Tenness\xe2\x80\xa6 "27"       "24"            2020\n#>  8 8     "Sun" "November\xe2\x80\xa6 "boxs\xe2\x80\xa6 "7-0"  Baltimo\xe2\x80\xa6 "28"       "24"            2020\n#>  9 9     "Sun" "November\xe2\x80\xa6 "boxs\xe2\x80\xa6 "8-0"  Dallas \xe2\x80\xa6 "24"       "19"            2020\n#> 10 10    "Sun" "November\xe2\x80\xa6 "boxs\xe2\x80\xa6 "9-0"  Cincinn\xe2\x80\xa6 "36"       "10"            2020\n#> # \xe2\x80\xa6 with 67 more rows, and 1 more variable: team <chr>\n

Run Code Online (Sandbox Code Playgroud)\n

^{由reprex 包(v2.0.1)于 2022 年 2 月 22 日创建}

\n

归档时间：	3 年，11 月前
查看次数：	742 次
最近记录：	3 年，11 月前