提取href attr或将节点转换为字符列表

Question

提取href attr或将节点转换为字符列表

我尝试从网站上提取一些信息

library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)

nodes <- html_nodes(html, ".listItemSolr") 
nodes

Run Code Online (Sandbox Code Playgroud)

我得到30个HTML代码的"列表".我希望从"list"提取最后一个href属性的每个元素,所以对于30.元素它将是

<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobot? prezentacja has?a i programu wyborczego Komorowskiego">

Run Code Online (Sandbox Code Playgroud)

所以我想得到字符串

"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"

Run Code Online (Sandbox Code Playgroud)

问题是html_attr(nodes, "href")行不通(我得到NA的矢量).所以我想到了正则表达式,但问题nodes是不是字符列表.

class(nodes)
[1] "XMLNodeSet"

Run Code Online (Sandbox Code Playgroud)

我试过了

xmlToList(nodes)

Run Code Online (Sandbox Code Playgroud)

但它也不起作用.

所以我的问题是:如何使用为HTML创建的某个函数提取此URL？或者,如果不可能,我如何将XMLNodeSet转换为字符列表？

Answer 1

ber*_*ant 8

尝试在节点的子节点内搜索:

nodes <- html_nodes(html, ".listItemSolr") 

sapply(html_children(nodes), function(x){
  html_attr( x$a, "href")
})

Run Code Online (Sandbox Code Playgroud)

更新

哈德利建议使用优雅的管道:

html %>%  
  html_nodes(".listItemSolr") %>% 
  html_nodes(xpath = "./a") %>% 
  html_attr("href")

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	1979 次
最近记录：	10 年，10 月前