IVR*_*IVR 13 youtube xpath web-scraping
我正在从一系列网站(如reddit.com)中提取用户评论,而Youtube也是我的另一个多汁信息来源.我现有的刮刀用R写的:
# x is the url
html = getURL(x)
doc = htmlParse(html, asText=TRUE)
txt = xpathSApply(doc,
//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue)
Run Code Online (Sandbox Code Playgroud)
这不适用于Youtube数据,事实上,如果您查看此类 Youtube视频的来源,您会发现注释不会出现在源代码中.
有没有人对如何在这种情况下提取数据有任何建议?
非常感谢!
以下答案:R:rvest:抓取动态电子商务页面
您可以执行以下操作:
devtools::install_github("ropensci/RSelenium") # Install from github
library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"
# scroll down
for(i in 1:5){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()
#combine the data in a data.frame
dat <- data.frame(author = author, text = text)
Result:
> head(dat)
author text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2 Tatjana Celinska Ciao 0
3 Yvette Austin GET OUT OF MYÂ HEAD!!!!
4 Susan II Watch narhwals
5 Greg Ginger who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6 Arnav Sinha LOL what the hell is this?
Run Code Online (Sandbox Code Playgroud)
评论1:你确实需要github版本见rselenium | 获取youtube页面源代码
评论2:此代码为您提供了最初的44条评论.一些评论有一个"显示所有答案"链接,必须点击.另外,要查看更多评论,您必须单击页面底部的显示更多按钮.在这个优秀的RSelenium教程中点击了点击:http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html
| 归档时间: |
|
| 查看次数: |
5564 次 |
| 最近记录: |