R:使用rvest包而不是XML包来从URL获取链接

cap*_*apm 11 xml r web-scraping rvest

我使用XML包来获取此URL的链接.

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))
Run Code Online (Sandbox Code Playgroud)

虽然这种方法非常有效,但我使用rvest并且在解析网络时看起来更快XML.我试过html_nodeshtml_attrs,但我不能得到它的工作.

hrb*_*str 16

尽管我的评论,这里是你如何做到这一点rvest.请注意,我们需要首先在页面中读取,htmlParse因为该站点的内容类型设置text/plain为该文件,并且会rvest进入tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"  
Run Code Online (Sandbox Code Playgroud)

进一步示出rvestXML封装基础.

UPDATE

rvest::read_html() 现在可以直接处理:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
Run Code Online (Sandbox Code Playgroud)