cap*_*oma 5 r web-scraping rvest
我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:
http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=
该页面确实提供了一个robotx.txt 文件,但是,它的内容对公众免费提供,我认为抓取这些数据是不受禁止的。
这是这个问题的更新,因为我取得了一些进展。
到目前为止我取得的成就
# opens the first results page
# opens the first link as a table at the end of the page
library("rvest")
library("dplyr")
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
Run Code Online (Sandbox Code Playgroud)
下一篇:获取基础数据
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
Run Code Online (Sandbox Code Playgroud)
下一步:获取附加数据
# gives the desired informations (=additional data) of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
Run Code Online (Sandbox Code Playgroud)
我的问题:
# if I open the second search page
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
Run Code Online (Sandbox Code Playgroud)
下一篇:获取新的基础数据
# I get easily a table with the new results
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
Run Code Online (Sandbox Code Playgroud)
但是,如果我尝试获取新的附加数据,则会再次从第 1 页获得结果:
# does not give the desired output:
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
Run Code Online (Sandbox Code Playgroud)
问题:
__VIEWSTATE在新的期间可能会发生变化request_POST吗?| 归档时间: |
|
| 查看次数: |
277 次 |
| 最近记录: |