用特定的类刮取所有div标签的内容

Question

用特定的类刮取所有div标签的内容

我正在从某个特定的div类中的网站上抓取所有文本。在下面的示例中，我想提取类“ a”的div中的所有内容。

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

Run Code Online (Sandbox Code Playgroud)

我想要的输出是...

"Hello, world"
"Good afternoon, world"

Run Code Online (Sandbox Code Playgroud)

下面的代码从每个div中提取文本，但是我不知道如何仅包括class =“ a”。

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

Run Code Online (Sandbox Code Playgroud)

使用Python的BeautifulSoup，它看起来像site.find_all("div", class_="a")。

Answer 1

DJa*_*ack 6

site %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@class="a"]') %>% 
  html_text()

Run Code Online (Sandbox Code Playgroud)

Answer 2

nei*_*fws 6

的CSS选择器div with class = "a"是div.a：

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

Run Code Online (Sandbox Code Playgroud)

或者，您可以使用XPath：

html_nodes(xpath = "//div[@class='a']")

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年前
查看次数：	4106 次
最近记录：	7 年，9 月前