89_*_*ple 6 selenium curl r download web-scraping
我想从该网站下载每个State X作物年度X标准报告组合的pdf或excel中的所有数据。
我按照本教程做我想做的事。 从URL下载数据
但是,我在第二行遇到了一个错误。
driver <- rsDriver()
Error in subprocess::spawn_process(tfile, ...) :
group termination: could not assign process to a job: Access is denied
Run Code Online (Sandbox Code Playgroud)
我可以使用其他方法下载这些数据吗?
首先,检查网站上的robots.txt是否存在。然后阅读条款和条件(如有)。因此,节制以下请求始终很重要。
查看所有条款和条件后,下面的代码将帮助您入门:
library(httr)
library(xml2)
link <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
r <- GET(link)
doc <- read_html(content(r, "text"))
#write_html(doc, "temp.html")
states <- sapply(xml_find_all(doc, ".//select[@name='DdlState']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
states <- states[!grepl("^Select", names(states))]
years <- sapply(xml_find_all(doc, ".//select[@name='DdlYear']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
years <- years[!grepl("^Select", names(years))]
rptfmt <- sapply(xml_find_all(doc, ".//select[@name='DdlFormat']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
stdrpts <- unlist(lapply(xml_find_all(doc, ".//td/a"), function(x) {
id <- xml_attr(x, "id")
if (grepl("^TreeView1t", id)) return(setNames(id, xml_text(x)))
}))
get_vs <- function(doc) sapply(xml_find_all(doc, ".//input[@type='hidden']"), function(x)
setNames(xml_attr(x, "value"), xml_attr(x, "name")))
fmt <- rptfmt[2] #Excel format
for (sn in names(states)) {
for (yn in names(years)) {
for (srn in seq_along(stdrpts)) {
s <- states[sn]
y <- years[yn]
sr <- stdrpts[srn]
r <- POST(link,
body=as.list(c("__EVENTTARGET"="DdlState",
"__EVENTARGUMENT"="",
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"="",
"TreeView1_PopulateLog"="",
get_vs(doc),
DdlState=unname(s),
DdlYear=0,
DdlFormat=1)),
encode="form")
doc <- read_html(content(r, "text"))
treeview <- c("__EVENTTARGET"="TreeView1",
"__EVENTARGUMENT"=paste0("sStandard Reports\\", srn),
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"=unname(stdrpts[srn]),
"TreeView1_PopulateLog"="")
vs <- get_vs(doc)
ddl <- c(DdlState=unname(s), DdlYear=unname(y), DdlFormat=unname(fmt))
r <- POST(link, body=as.list(c(treeview, vs, ddl)), encode="form")
if (r$headers$`content-type`=="application/vnd.ms-excel")
writeBin(content(r, "raw"), paste0(sn, "_", yn, "_", names(stdrpts)[srn], ".xls"))
Sys.sleep(5)
}
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
246 次 |
| 最近记录: |