Kru*_*rug 5 html for-loop r lapply web-scraping
此代码从此处下载http://www.bls.gov/schedule/news_release/2015_sched.htm每个包含"发布"列下的"就业情况"的日期.
pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")
# target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")
# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
# clean up the cruft and make our dates!
nfpdates2015 <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
###thanks @hrbrmstr for this###
Run Code Online (Sandbox Code Playgroud)
我想重复其他包含其他年份的URL,以相同的方式命名,只更改年份数.特别是,对于以下URL:
#From 2008 to 2015
http://www.bls.gov/schedule/news_release/2015_sched.htm
http://www.bls.gov/schedule/news_release/2014_sched.htm
...
http://www.bls.gov/schedule/news_release/2008_sched.htm
Run Code Online (Sandbox Code Playgroud)
我的知识rvest,HTML而且XML几乎是不存在的.我想用for循环应用相同的代码,但我的努力是徒劳的.当然,我可以重复2015年的代码八次以获得所有年份,它既不会花费太长时间也不会占用太多空间.但我很想知道如何以更有效的方式完成这项工作.谢谢.
在循环中,您将url使用paste0语句更改字符串
for(i in 2008:2015){
url <- paste0("http://www.bls.gov/schedule/news_release/", i, "_sched.htm")
pg <- read_html(url)
## all your other code goes here.
}
Run Code Online (Sandbox Code Playgroud)
或使用an lapply返回结果列表.
lst <- lapply(2008:2015, function(x){
url <- paste0("http://www.bls.gov/schedule/news_release/", x, "_sched.htm")
## all your other code goes here.
pg <- read_html(url)
# target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")
# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
# clean up the cruft and make our dates!
nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
return(nfpdates)
})
Run Code Online (Sandbox Code Playgroud)
哪个回报
lst
[[1]]
[1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01" "2008-09-05"
[10] "2008-10-03" "2008-11-07" "2008-12-05"
[[2]]
[1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07" "2009-09-04"
[10] "2009-10-02" "2009-11-06" "2009-12-04"
## etc...
Run Code Online (Sandbox Code Playgroud)
这可以用sprintf(没有循环)完成
url <- sprintf("http://www.bls.gov/schedule/news_release/%d_sched.htm", 2008:2015)
url
#[1] "http://www.bls.gov/schedule/news_release/2008_sched.htm" "http://www.bls.gov/schedule/news_release/2009_sched.htm"
#[3] "http://www.bls.gov/schedule/news_release/2010_sched.htm" "http://www.bls.gov/schedule/news_release/2011_sched.htm"
#[5] "http://www.bls.gov/schedule/news_release/2012_sched.htm" "http://www.bls.gov/schedule/news_release/2013_sched.htm"
#[7] "http://www.bls.gov/schedule/news_release/2014_sched.htm" "http://www.bls.gov/schedule/news_release/2015_sched.htm"
Run Code Online (Sandbox Code Playgroud)
如果我们需要阅读链接
library(rvest)
lst <- lapply(url, function(x) {
pg <- read_html(x)
body <- html_nodes(pg, "div#bodytext")
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
nfpdates
})
head(lst, 3)
#[[1]]
# [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01"
# [9] "2008-09-05" "2008-10-03" "2008-11-07" "2008-12-05"
#[[2]]
# [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07"
# [9] "2009-09-04" "2009-10-02" "2009-11-06" "2009-12-04"
#[[3]]
# [1] "2010-01-08" "2010-02-05" "2010-03-05" "2010-04-02" "2010-05-07" "2010-06-04" "2010-07-02" "2010-08-06"
# [9] "2010-09-03" "2010-10-08" "2010-11-05" "2010-12-03"
Run Code Online (Sandbox Code Playgroud)