mat*_*sho 6 r reddit text-mining web-scraping
我有功能在2014-11-01和2015-10-31之间刮掉比特币subreddit中的所有帖子.
但是,我只能提取大约990个帖子,这些帖子可以追溯到10月25日.我不明白发生了什么.在参考https://github.com/reddit/reddit/wiki/API之后,我在每个提取之间包含了一个15秒的Sys.sleep ,但无济于事.
此外,我尝试从另一个subreddit(健身)刮,但它也返回约900个帖子.
require(jsonlite)
require(dplyr)
getAllPosts <- function() {
url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
extract <- fromJSON(url)
posts <- extract$data$children$data %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
extract.next <- fromJSON(url.next)
posts.next <- extract.next$data$children$data
# execute while loop as long as there are any rows in the data frame
while (!is.null(nrow(posts.next))) {
posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc,
title, selftext)
posts <- rbind(posts, posts.next)
after <- posts[nrow(posts),1]
url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
Sys.sleep(15)
extract <- fromJSON(url.next)
posts.next <- extract$data$children$data
}
posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
return(posts)
}
posts <- getAllPosts()
Run Code Online (Sandbox Code Playgroud)
reddit是否有某种限制,我正在打?