tor*_*ino 6 html xml r web-scraping
我正在抓http://www.progarchives.com/album.asp?id=
一个警告信息:
警告消息:
XML内容似乎不是XML:http:
//www.progarchives.com/album.asp
? id = 2 http://www.progarchives.com/album.asp?id=3 http:// www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5
刮刀分别适用于每个页面,但不适用于网址b1=2:b2=1000
.
library(RCurl)
library(XML)
getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}
prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}
prog.arch.scraper(urls)
Run Code Online (Sandbox Code Playgroud)
下面是一个替代方法以rvest
和dplyr
:
library(rvest)
library(dplyr)
library(pbapply)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- html(sprintf(base_url, id))
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
albums <- bind_rows(pblapply(2:10, get_album_info))
head(albums)
## Source: local data frame [6 x 3]
##
## album date band
## 1 FOXTROT Studio Album, released in 1972 Genesis
## 2 NURSERY CRYME Studio Album, released in 1971 Genesis
## 3 GENESIS LIVE Live, released in 1973 Genesis
## 4 A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
## 6 GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
Run Code Online (Sandbox Code Playgroud)
我不想用大量的请求来阻止网站,所以要提高你的使用顺序.pblapply
为您提供免费进度条.
为了对网站友好(特别是因为它没有明确禁止抓取)你可能想Sys.sleep(10)
在get_album_info
函数的末尾抛出一个.
UPDATE
要处理服务器错误(在这种情况下500
,但它也适用于其他人),您可以使用try
:
library(rvest)
library(dplyr)
library(pbapply)
library(data.table)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- try(html(sprintf(base_url, id)), silent=TRUE)
if (inherits(pg, "try-error")) {
data.frame(album=character(0), date=character(0), band=character(0))
} else {
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
}
albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))
## album date band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
## 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
## 3: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
Run Code Online (Sandbox Code Playgroud)
您将不会获得错误页面的任何条目(在这种情况下,它只返回id 9,10和30的条目).
归档时间: |
|
查看次数: |
1063 次 |
最近记录: |