两个标签之间的网页抓取

sta*_*oob 0 html r web-scraping

我正在尝试抓取以下页面:

http://mywebsite.com

特别是,我想获取每个条目的名称。我注意到我感兴趣的文本始终位于(MY TEXT)这两个标签的中间: <div class="title"> <a href="your text"> MY TEXT </a>

我知道如何单独搜索这些标签:

#load libraries 
library(rvest)
library(httr)
library(XML)
library(rvest)

# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)

#option 1
b = page %>% html_nodes("title")

option1 <- b %>% html_text() %>% strsplit("\\n")

#option 2
b = page %>% html_nodes("a")

option2 <- b %>% html_text() %>% strsplit("\\n")
Run Code Online (Sandbox Code Playgroud)

有什么方法可以指定“html_nodes”参数,以便它在“我的文本”上拾取 - 即在 <div class="title">和之间刮擦</a>

 <div class="title"> <a href="your text"> MY TEXT </a>
Run Code Online (Sandbox Code Playgroud)

Hoe*_*elR 5

抓取页面1:10

library(tidyverse)
library(rvest)

my_function <- function(page_n) {
  
  cat("Scraping page ", page_n, "\n")
  
  page <- paste0("https://www.dentistsearch.ca/search-doctor/",
    page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
  
  tibble(title = page %>%
           html_elements(".title a") %>%
           html_text2(),
         adress = page %>%  
           html_elements(".marker") %>% 
           html_text2(),
         page = page_n)
}

df <- map_dfr(1:10, my_function)
Run Code Online (Sandbox Code Playgroud)