sta*_*oob 0 html r web-scraping
我正在尝试抓取以下页面:
http://mywebsite.com
特别是,我想获取每个条目的名称。我注意到我感兴趣的文本始终位于(MY TEXT)这两个标签的中间: <div class="title"> <a href="your text"> MY TEXT </a>
我知道如何单独搜索这些标签:
#load libraries
library(rvest)
library(httr)
library(XML)
library(rvest)
# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)
#option 1
b = page %>% html_nodes("title")
option1 <- b %>% html_text() %>% strsplit("\\n")
#option 2
b = page %>% html_nodes("a")
option2 <- b %>% html_text() %>% strsplit("\\n")
Run Code Online (Sandbox Code Playgroud)
有什么方法可以指定“html_nodes”参数,以便它在“我的文本”上拾取 - 即在 <div class="title">和之间刮擦</a>:
<div class="title"> <a href="your text"> MY TEXT </a>
Run Code Online (Sandbox Code Playgroud)
抓取页面1:10
library(tidyverse)
library(rvest)
my_function <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.dentistsearch.ca/search-doctor/",
page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
df <- map_dfr(1:10, my_function)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
105 次 |
| 最近记录: |