如何读取和解析R中网页的内容

Mar*_*ark 12 html screen-scraping r html-content-extraction

我想读一下R中URL(eq,http://www.haaretz.com/)的内容.我想知道如何做到这一点

Sha*_*ane 32

不确定你想如何处理那个页面,因为它真的很乱.正如我们在这个着名的stackoverflow问题中重新学习的那样,在html上进行正则表达式并不是一个好主意,因此你肯定希望用XML包解析它.

这是一个让你入门的例子:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
Run Code Online (Sandbox Code Playgroud)

这会产生一个主要只是网页文字的字符向量(以及一些javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 
Run Code Online (Sandbox Code Playgroud)