小编Bri*_*her的帖子

“下一页” rvest 刮擦的功能

我在底部添加了我使用的最终代码，以防有人有类似的问题。我使用了下面提供的答案，但添加了几个节点、系统睡眠时间（以防止被服务器踢出）和一个 if 参数，以防止在抓取最后一个有效页面后出现错误。

我正在尝试使用下一页功能从网站中提取多个页面。我创建了一个带有 nextpage 变量的数据框，并用起始 url 填充了第一个值。

#building dataframe with variables
bframe <- data.frame(matrix(ncol = 3, nrow = 10000))
x <- c("curpage", "nexturl", "posttext")
colnames(bframe) <- x

#assigning first value for nexturl
bframe$nexturl[[1]] <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"

Run Code Online (Sandbox Code Playgroud)

我想按如下方式提取文本（我知道代码很笨拙——我是全新的——但它确实得到了我想要的）

##create html object
blogfunc    <-  read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext    <-  blogfunc    %>% 
    html_nodes(".article-content")%>%           
    html_text()                 
posttext    <-  gsub('[\a]', '', blogfunc)
posttext    <-  gsub('[\t]', '', blogfunc)
posttext    <-  gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <-  blogfunc    %>% 
    html_nodes(".prev-post-link-wrap a") %>% …

Run Code Online (Sandbox Code Playgroud)

r scrape rvest

Bri*_*her

2017 04-09

2
推荐指数

1
解决办法

2248
查看次数

标签统计

r ×1

rvest ×1

scrape ×1

“下一页” rvest 刮擦的功能

标签 统计

小编Bri_her的帖子

标签统计