cap*_*apm 11 xml r web-scraping rvest
我使用XML包来获取此URL的链接.
# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))
Run Code Online (Sandbox Code Playgroud)
虽然这种方法非常有效,但我使用rvest并且在解析网络时看起来更快XML.我试过html_nodes和html_attrs,但我不能得到它的工作.
hrb*_*str 16
尽管我的评论,这里是你如何做到这一点rvest.请注意,我们需要首先在页面中读取,htmlParse因为该站点的内容类型设置text/plain为该文件,并且会rvest进入tizzy.
library(rvest)
library(XML)
pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")
## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"
## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html"
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html"
## [275] "/inf_corporativa98959_ZNC.html"
Run Code Online (Sandbox Code Playgroud)
进一步示出rvest的XML封装基础.
UPDATE
rvest::read_html() 现在可以直接处理:
pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
Run Code Online (Sandbox Code Playgroud)