从带有表单的页面中抓取数据

Question

从带有表单的页面中抓取数据

我是网络抓取的新手，我想获取此网页的数据：http : //www.neotroptree.info/data/countrysearch

在此链接中，我们看到四个字段（国家、域、州和站点）。

我有一个包含站点名称的数据框，我使用以下代码对其进行了抓取：

ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

ipak(c("rgdal", "tidyverse"))
#> Loading required package: rgdal
#> Loading required package: sp
#> rgdal: version: 1.3-4, (SVN revision 766)
#>  Geospatial Data Abstraction Library extensions to R successfully loaded
#>  Loaded GDAL runtime: GDAL 2.2.2, released 2017/09/15
#>  Path to GDAL shared files: /usr/share/gdal/2.2
#>  GDAL binary built with GEOS: TRUE
#>  Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
#>  Path to PROJ.4 shared files: (autodetected)
#>  Linking to sp version: 1.3-1
#> Loading required package: tidyverse
#>     rgdal tidyverse
#>      TRUE      TRUE

download.file(url = "http://www.neotroptree.info/files/Neotropicos.kmz", 
              destfile = "neotroptree-site.kmz", 
              quiet = FALSE)

rgdal::ogrListLayers("neotroptree-site.kmz")
#> [1] "Neotropicos"
#> [2] "Jubones, Quito, Pichincha, Ecuador"
#> attr(,"driver")
#> [1] "LIBKML"
#> attr(,"nlayers")
#> [1] 2

ntt <- rgdal::readOGR("neotroptree-site.kmz", "Neotropicos")
#> OGR data source with driver: LIBKML
#> Source: "/tmp/Rtmppf54qE/neotroptree-site.kmz", layer: "Neotropicos"
#> with 7504 features
#> It has 11 fields

ntt.df <- data.frame(site = ntt@data$Name,
                     long = ntt@coords[, 1],
                     lat = ntt@coords[, 2]) %>%
  .[order(.$site), ] %>%
  mutate(., ID = rownames(.)) %>%
  mutate(., site = as.character(site))

ntt.df <- ntt.df[, c("ID", "site", "long", "lat")]

glimpse(ntt.df)
#> Observations: 7,504
#> Variables: 4
#> $ ID   <chr> "2618", "2612", "3229", "2717", "2634", "4907", "3940", "...
#> $ site <chr> "Abadia, cerrado", "Abadia, floresta semidecidual", "Abad...
#> $ long <dbl> -43.15000, -43.10667, -48.72250, -45.52493, -45.27417, -4...
#> $ lat  <dbl> -17.690000, -17.676944, -16.089167, -19.111667, -19.26638...

Run Code Online (Sandbox Code Playgroud)

手动，我需要：

用数据框站点列中的每个名称填充“站点”字段，获取结果，然后转到“站点详细信息”链接

从“下载站点详细信息”链接获取数据。

我的第一次尝试是使用该rvest包，但无法在网页中找到表单字段。

if(!require("rvest")) install.packages("rvest")
#> Loading required package: rvest
#> Loading required package: xml2

url <- "http://www.neotroptree.info/data/countrysearch"
webpage <- html_session(url)
webpage %>%
  html_form()
#> list()

Run Code Online (Sandbox Code Playgroud)

关于如何迭代这个过程的任何想法？

Answer 1

hrb*_*str 5

RSeleniumdecapitated和splashr所有这些都引入了难以设置和维护的第三方依赖项。

这里不需要浏览器检测，所以不需要 RSelenium。decapitated也不会真正有多大帮助，splashr对于这个用例来说有点矫枉过正。

您在站点上看到的表单是 Solr 数据库的代理。在浏览器上打开开发者工具，点击该 URL 刷新并查看网络部分的 XHR 部分。您会看到它在加载和每个表单交互时发出异步请求。

我们所要做的就是模仿这些交互。下面的源代码有大量注释，您可能想手动逐一浏览它们以查看引擎盖下发生了什么。

我们需要一些帮手：

library(xml2)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(tidyverse)

Run Code Online (Sandbox Code Playgroud)

当你加载时，大多数 ^^ 无论如何都会加载，rvest但我喜欢明确。此外，stringr对于更明确的操作命名stringi函数来说，它是不必要的拐杖，因此我们将使用它们。

首先，我们得到站点列表。此函数模拟了POST您在接受建议以使用开发人员工具查看发生了什么时希望看到的请求：

get_list_of_sites <- function() {

  # This is the POST reques the site makes to get the metdata for the popups.
  # I used http://gitlab.com/hrbrmstr/curlconverter to untangle the monstosity
  httr::POST(
    url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
    body = list(
      q = "*%3A*",
      host = "padme.rbge.org.uk",
      c = "neotroptree",
      template = "countries.tpl",
      datasetid = "",
      f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
    ),
    encode = "form"
  ) -> res

  httr::stop_for_status(res)

  # extract the returned JSON from the HTML document it returns
  xdat <- jsonlite::fromJSON(html_text(content(res, encoding="UTF-8")))

  # only return the site list (the xdat structure had alot more in it tho)
  discard(xdat$facets$sitename_s, stri_detect_regex, "^[[:digit:]]+$")

}

Run Code Online (Sandbox Code Playgroud)

我们将在下面调用它，但它只返回站点名称的字符向量。

现在我们需要一个函数来获取表单输出下部返回的站点数据。这与上面的操作相同，只是它增加了下载站点以及存储文件的位置的功能。overwrite很方便，因为您可能会进行大量下载并尝试再次下载相同的文件。由于我们使用httr::write_disk()来保存文件，因此将此参数设置为FALSE将导致异常并停止您拥有的任何循环/迭代。你可能不想那样。

get_site <- function(site, dl_path, overwrite=TRUE) {

  # this is the POST request the site makes as an XHR request so we just
  # mimic it with httr::POST. We pass in the site code in `q`

  httr::POST(
    url = "http://www.neotroptree.info/data/sys/scripts/solrform/solrproxy.php",
    body = list(
      q = sprintf('sitename_s:"%s"', curl::curl_escape(site)),
      host = "padme.rbge.org.uk",
      c = "neotroptree",
      template = "countries.tpl",
      datasetid = "",
      f = "facet.field%3Dcountry_s%26facet.field%3Dstate_s%26facet.field%3Ddomain_s%26facet.field%3Dsitename_s"
    ),
    encode = "form"
  ) -> res

  httr::stop_for_status(res)

  # it returns a JSON structure 
  xdat <- httr::content(res, as="text", encoding="UTF-8")
  xdat <- jsonlite::fromJSON(xdat)

  # unfortunately the bit with the site-id is in HTML O_o
  # so we have to parse that bit out of the returned JSON
  site_meta <- xml2::read_html(xdat$docs)

  # now, extract the link code
  link <- html_attr(html_node(site_meta, "div.solrlink"), "data-linkparams")
  link <- stri_replace_first_regex(link, "code_s:", "")

  # Download the file and get the filename metadata back
  xret <- get_link(link, dl_path) # the code for this is below

  # add the site name
  xret$site <- site

  # return the list
  xret[c("code", "site", "path")]

}

Run Code Online (Sandbox Code Playgroud)

我将用于检索文件的代码放入一个单独的函数中，因为将这个功能封装到一个单独的函数中似乎很有意义。天啊。我也冒昧地删除了,文件名中的无意义内容。

get_link <- function(code, dl_path, overwrite=TRUE) {

  # The Download link looks like this:
  #
  # <a href="http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=AtlMG104">
  #   Download site details.
  # </a>
  #
  # So we can mimic that with httr

  site_tmpl <- "http://www.neotroptree.info/projectfiles/downloadsitedetails.php?siteid=%s"
  dl_url <- sprintf(site_tmpl, code)

  # The filename comes in a "Content-Disposition" header so we first
  # do a lightweight HEAD request to get the filename

  res <- httr::HEAD(dl_url)
  httr::stop_for_status(res)

  stri_replace_all_regex(
    res$headers["content-disposition"],
    '^attachment; filename="|"$', ""
  ) -> fil_name

  # commas in filenames are a bad idea rly
  fil_name <- stri_replace_all_fixed(fil_name, ",", "-")

  message("Saving ", code, " to ", file.path(dl_path, fil_name))

  # Then we use httr::write_disk() to do the saving in a full GET request
  res <- httr::GET(
    url = dl_url,
    httr::write_disk(
      path = file.path(dl_path, fil_name),
      overwrite = overwrite
    )
  )

  httr::stop_for_status(res)

  # return a list so we can make a data frame
  list(
    code = code,
    path = file.path(dl_path, fil_name)
  )

}

Run Code Online (Sandbox Code Playgroud)

现在，我们得到了站点列表（如承诺的那样）：

# get the site list
sites <- get_list_of_sites()

length(sites)
## [1] 7484

head(sites)
## [1] "Abadia, cerrado"                                     
## [2] "Abadia, floresta semidecídua"                        
## [3] "Abadiânia, cerrado"                                  
## [4] "Abaetetuba, Rio Urubueua, floresta inundável de maré"
## [5] "Abaeté, cerrado"                                     
## [6] "Abaeté, floresta ripícola"

Run Code Online (Sandbox Code Playgroud)

我们将抓取一个站点 ZIP 文件：

# get one site link dl
get_site(sites[1], "/tmp")
## $code
## [1] "CerMG044"
## 
## $site
## [1] "Abadia, cerrado"
## 
## $path
## [1] "/tmp/neotroptree-CerMG04426-09-2018.zip"

Run Code Online (Sandbox Code Playgroud)

现在，再获取一些并返回包含代码、站点和保存路径的数据框：

# get a few (remomove [1:2] to do them all but PLEASE ADD A Sys.sleep(5) into get_link() if you do!)
map_df(sites[1:2], get_site, dl_path = "/tmp")
## # A tibble: 2 x 3
##   code     site                         path                              
##   <chr>    <chr>                        <chr>                             
## 1 CerMG044 Abadia, cerrado              /tmp/neotroptree-CerMG04426-09-20…
## 2 AtlMG104 Abadia, floresta semidecídua /tmp/neotroptree-AtlMG10426-09-20…

Run Code Online (Sandbox Code Playgroud)

如果您要进行批量下载，请注意添加Sys.sleep(5)到的指南get_link()。CPU、内存和带宽不是免费的，而且该站点很可能没有真正扩展服务器来满足大约 8,000 个背靠背多 HTTP 请求调用序列，并在它们结束时下载文件。

归档时间：	7 年，2 月前
查看次数：	259 次
最近记录：	5 年，8 月前