使用 rvest 获取表中每一行的 href 属性

Question

使用 rvest 获取表中每一行的 href 属性

我正在尝试提取类似于以下内容的表的所有链接：

<!DOCTYPE html>
<html>
<body>

<table>
  <tr>
    <td>
      <a href="https://www.r-project.org/">R</a><br>
      <a href="https://www.rstudio.com/">RStudio</a>
    </td>
  </tr>
  <tr>
    <td>
      <a href="https://community.rstudio.com/">Rstudio Community</a>
    </td>
  </tr>
</table>

</body>
</html>

Run Code Online (Sandbox Code Playgroud)

我想做的是在末尾获取一个数据帧（或向量）列表，其中每个数据帧包含 html 表中每一行的所有链接。例如，在这种情况下，列表将具有向量 1 c("https://www.r-project.org/","https://www.rstudio.com/")，第二个向量将为c("https://community.rstudio.com/")。我现在遇到的主要问题是，当我执行以下操作时，我无法保留与每个节点的 href 关系：

library(rvest)

web <- read_html("table.html") %>%
  html_nodes("table") %>%
  html_nodes("tr") %>%
  html_nodes("a") %>%
  html_attr("href")

Run Code Online (Sandbox Code Playgroud)

Answer 1

And*_*tar 6

一种方法是添加搜索，将"a"术语替换为html_node，这将生成每个中仅包含第一个 url 的列表tr。然后，您可以使用它来将完整列表分成几组。

page <- read_html("table.html") #just read the html once

web <- page %>%
  html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
  html_attr("href") #as above

web2 <- page %>%
  html_nodes("table") %>% html_nodes("tr") %>% html_node("a") %>%
  html_attr("href") #just the first url in each tr

webdf <- data.frame(web=web, #full list
                    group=cumsum(web %in% web2), #grouping indicator by tr
                    stringsAsFactors=FALSE)

webdf
                             web group
1     https://www.r-project.org/     1
2       https://www.rstudio.com/     1
3 https://community.rstudio.com/     2

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，7 月前
查看次数：	2843 次
最近记录：	7 年，7 月前