使用R从网页中提取元描述

Bla*_*las 2 r httr rvest

您好我正在尝试检索这些wepages元描述

从页面来源"

Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", 
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
Run Code Online (Sandbox Code Playgroud)

期望的输出

Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook", 
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
Run Code Online (Sandbox Code Playgroud)

我试图用httr来完成这个任务但是我无法以所需的输出格式获取它或者从使用GET命令检索的内容中提取内容

library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url        : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers    :List of 22
..$ server                     : chr "Apache/2.2"
Run Code Online (Sandbox Code Playgroud)

我需要从源代码中提取的字段在此字符串之后

<meta itemprop="description" content="
Run Code Online (Sandbox Code Playgroud)

像这样

<meta itemprop="description" content="&#039;Spam King&#039; 
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages" 
Run Code Online (Sandbox Code Playgroud)

ali*_*ire 6

你真的只需要rvest.由于它们都是<h1>标题,您可以遍历URL列表,选择标题:

library(rvest)

sapply(Data$Pages, 
       function(url){
           url %>% 
               as.character() %>%   # in case strings are stored as factors
               read_html() %>% 
               html_nodes('h1') %>% 
               html_text()
           })

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting" 
Run Code Online (Sandbox Code Playgroud)

或者,如果你真的想要刮掉<meta>标签,你可以用相同的方式来做,尽管选择器更加痛苦:

sapply(Data$Pages, function(url){
    url %>% 
        as.character() %>% 
        read_html() %>% 
        html_nodes(xpath = '//meta[@itemprop="description"]') %>% 
        html_attr('content')
    })
Run Code Online (Sandbox Code Playgroud)

无论哪种方式都可以得到相同的结果