Rvest webscraping错误 - 识别css或xpath?

ell*_*iot 1 r web-scraping rvest

卢旺达合作社有一个数据库; 它有大约155页我想访问的数据(没有滚动整个网站).我在使用rvestR中的包识别正确的xpath或css时遇到问题.我也使用该selector gadget工具来帮助识别正确的节点.

我的问题是我收到一个'字符(0)'表示我没有抓取正确的数据.

url <- 'http://www.rca.gov.rw/wemis/registration/all.php?start=0&status=approved'

html <- read_html(url)

rank_data_html <- html_nodes(html, css = '.primary td')

rank_data <- html_text(rank_data_html)

head(rank_data)
Run Code Online (Sandbox Code Playgroud)

有没有办法改变代码循环并刮掉数据?

hrb*_*str 6

这与使用错误的选择器无关.您正在抓取的网站在首次访问时做了一些非常有趣的事情:

在此输入图像描述

当你点击页面它设置然后一个cookie刷新页面(这是我见过的强制"会话"的最愚蠢的方式之一,曾经).

除非您使用代理服务器捕获Web请求,否则即使在浏览器开发人员工具的网络选项卡中也无法实现这一点.你也可以看到它,看看read_html()你做的初始调用回来了什么(它只有javascript +重定向).

由于设置cookie的方式是通过javascript,因此既read_html()不能httr::GET()直接帮助你.

但!所有的希望都没有丢失,并且不需要像Selenium或Splash那样愚蠢的第三方要求(我很惊讶,驻场专家尚未提出这一点,因为这些日子似乎是默认回应).

让我们的饼干(确保这是因为新鲜,重新启动新的R会话libcurl-它curl使用它,反过来,使用httr::GET()read_html()最终用途-保持饼干(我们将使用这个功能,继续刮的网页,但如果有什么出错了你可能需要从一个新的会议开始).

library(xml2)
library(httr)
library(rvest)
library(janitor)

# Get access cookie

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  query = list(
    start = "0",
    status = "approved"
  )
) -> res

ckie <- httr::content(res, as = "text", encoding = "UTF-8")
ckie <- unlist(strsplit(ckie, "\r\n"))
ckie <- grep("cookie", ckie, value = TRUE)
ckie <- gsub("^document.cookie = '_accessKey2=|'$", "", ckie)
Run Code Online (Sandbox Code Playgroud)

现在,我们将设置cookie并获取我们的PHP会话cookie,这两个cookie将在之后持续存在:

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  httr::set_cookies(`_accessKey2` = ckie),
  query = list(
    start = "0",
    status = "approved"
  )
) -> res
Run Code Online (Sandbox Code Playgroud)

现在,有超过400页,所以我们将在你弄错了并需要重新解析页面的情况下缓存原始HTML.这样你就可以迭代文件而不是再次访问网站.为此,我们将为他们创建一个目录:

dir.create("rca-temp-scrape-dir")
Run Code Online (Sandbox Code Playgroud)

现在,创建分页开始编号:

pgs <- seq(0L, 8920L, 20)
Run Code Online (Sandbox Code Playgroud)

并且,迭代它们.注意:我不需要所有400多页,所以我只是做了10.删除它[1:10]以获取所有.此外,除非你喜欢伤害其他人,否则请保持睡眠状态,因为你不需要支付cpu /带宽,而且该网站可能非常脆弱.

lapply(pgs[1:10], function(pg) { 

  Sys.sleep(5) # Please don't hammer servers you don't pay for

  httr::GET(
    url = "http://www.rca.gov.rw/wemis/registration/all.php",
    query = list(
      start = pg,
      status = "approved"
    )
  ) -> res

  # YOU SHOULD USE httr FUNCTIONS TO CHECK FOR STATUS
  # SINCE THERE CAN BE HTTR ERRORS THAT YOU MAY NEED TO 
  # HANDLE TO AVOID CRASHING THE ITERATION

  out <- httr::content(res, as = "text", encoding = "UTF-8")

  # THIS CACHES THE RAW HTML SO YOU CAN RE-SCRAPE IT FROM DISK IF NECESSARY  

  writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

  out <- xml2::read_html(out)
  out <- rvest::html_node(out, "table.primary")
  out <- rvest::html_table(out, header = TRUE, trim = TRUE)

  janitor::clean_names(out) # makes better column names

}) -> recs
Run Code Online (Sandbox Code Playgroud)

最后,我们将这20个数据框合并为一个:

recs <- do.call(rbind.data.frame, recs)

str(recs)
## 'data.frame': 200 obs. of  9 variables:
##  $ s_no       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ code       : chr  "BUG0416" "RBV0494" "GAS0575" "RSZ0375" ...
##  $ name       : chr  "URUMURI RWA NGERUKA" "BADUKANA IBAKWE NYAKIRIBA" "UBUDASA COOPERATIVE" "KODUKB" ...
##  $ certificate: chr  "RCA/0734/2018" "RCA/0733/2018" "RCA/0732/2018" "RCA/0731/2018" ...
##  $ reg_date   : chr  "10.12.2018" "-" "10.12.2018" "07.12.2018" ...
##  $ province   : chr  "East" "West" "Mvk" "West" ...
##  $ district   : chr  "Bugesera" "Rubavu" "Gasabo" "Rusizi" ...
##  $ sector     : chr  "Ngeruka" "Nyakiliba" "Remera" "Bweyeye" ...
##  $ activity   : chr  "ubuhinzi (Ibigori, Ibishyimbo)" "ubuhinzi (Imboga)" "transformation (Amasabuni)" "ubworozi (Amafi)" ...
Run Code Online (Sandbox Code Playgroud)

如果您是tidyverse用户,您也可以这样做:

purrr::map_df(pgs[1:10], ~{

  Sys.sleep(5) 

  httr::GET(
    url = "http://www.rca.gov.rw/wemis/registration/all.php",
    httr::set_cookies(`_accessKey2` = ckie),
    query = list(
      start = .x,
      status = "approved"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

  out <- xml2::read_html(out)
  out <- rvest::html_node(out, "table.primary")
  out <- rvest::html_table(out, header = TRUE, trim = TRUE)

  janitor::clean_names(out)

}) -> recs
Run Code Online (Sandbox Code Playgroud)

lapply/do.call/rbind.data.frame方法相比.