R：POST 后抓取附加数据仅适用于第一页

我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息：

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

该页面确实提供了一个robotx.txt 文件，但是，它的内容对公众免费提供，我认为抓取这些数据是不受禁止的。

这是这个问题的更新，因为我取得了一些进展。

到目前为止我取得的成就

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

Run Code Online (Sandbox Code Playgroud)

下一篇：获取基础数据

# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

Run Code Online (Sandbox Code Playgroud)

下一步：获取附加数据

# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

Run Code Online (Sandbox Code Playgroud)

我的问题：

# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

Run Code Online (Sandbox Code Playgroud)

下一篇：获取新的基础数据

# I get easily a table with the new results

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

Run Code Online (Sandbox Code Playgroud)

但是，如果我尝试获取新的附加数据，则会再次从第 1 页获得结果：

# does not give the desired output:

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

Run Code Online (Sandbox Code Playgroud)

我要找的：第2页第一种药的详细资料

问题：

为什么我得到重复的结果？是因为__VIEWSTATE在新的期间可能会发生变化request_POST吗？
有没有办法解决这个问题？
有没有更好的方法来获取基本数据和附加数据？如果是，如何？

归档时间：	6 年，8 月前
查看次数：	277 次
最近记录：	6 年，8 月前