使用 R 从 html 页面提取数据

Question

使用 R 从 html 页面提取数据

Par*_*han 1 html xml r readlines web-scraping

我尝试从以下站点提取数据：

https://www.zomato.com/ncr/restaurants/north-indian

Run Code Online (Sandbox Code Playgroud)

使用R编程，我是该领域的学习者和初学者！

我尝试过这些：

> library(XML)

> doc<-htmlParse("the url mentioned above")

> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'

Run Code Online (Sandbox Code Playgroud)

这是一个......我也尝试了readLines()输出如下：-

> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]

> Error in file(con, "r") : cannot open the connection

> In addition: Warning message:

> In file(con, "r") : unsupported URL scheme

Run Code Online (Sandbox Code Playgroud)

我知道该页面不是错误中所示的 XML，但是我还有什么其他方法可以从该站点捕获数据...我确实尝试使用 tidy html 将其转换为 XML 或 XHTML，然后进行处理，但是我无处可去，也许我还不知道使用 tidy html 的实际过程！:( 不确定！建议解决此问题并进行更正（如果有）？

Answer 1

hrb*_*str 5

该rvest软件包也非常方便（并且构建在该XML软件包以及其他软件包之上）：

library(rvest)

pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")

# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()

##  [1] "Bukhara - ITC Maurya "                "Karim's "                            
##  [3] "Gulati "                              "Dhaba By Claridges "                 
## ...
## [27] "Dum-Pukht - ITC Maurya "              "Maal Gaadi "                         
## [29] "Sahib Sindh Sultan "                  "My Bar & Restaurant "                

# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)

##  [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，11 月前
查看次数：	4870 次
最近记录：	9 年，11 月前