在 R 中抓取文档

Question

在 R 中抓取文档

我正在尝试从以下网页下载一份 Word 文档。当您按下按钮时，Word文档将自动下载，而不显示任何下载链接。

现在我正在尝试使用 XPath，在 R 中下载此文档。

library(rvest)

# send an HTTP GET request to the URL
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"
page <- read_html(url)

# locate the link to the Word document using CSS selector
doc_link <- page %>%
  html_nodes(xpath='//*[@id="action_word_export"]')%>%
  html_attr("href")

Run Code Online (Sandbox Code Playgroud)

但不幸的是，这不起作用，并且无法下载任何内容。那么有人可以帮助如何解决这个问题并在R环境中下载Word文档吗？

Answer 1

Rus*_*uss 5

问题在于该按钮触发了一个实际发送下载请求的 JavaScript 脚本，因此没有href与该按钮直接关联的属性。如果您愿意使用RSelenium，可以通过以下方式下载该文件：

# load libraries
library(RSelenium)


# define target url
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)


# click on the download button ------------------------------------
remDr$findElement(using = "xpath",value = '//*[@id="action_word_export"]')$clickElement()

Run Code Online (Sandbox Code Playgroud)

该文件应下载到您的默认下载文件夹。

他们的下载链接也可能采用标准格式。您可以使用 Web 开发人员工具查看 javascript 脚本指向的 url 地址：

如果您将该位粘贴到主网址，您最终会得到一个也可以下载该文件的链接

download_link <- paste0("https://ec.europa.eu/taxation_customs/tedb/",
                        "exportTax.html?taxId=4205&taxVersionDate=1672527600")

Run Code Online (Sandbox Code Playgroud)

https://ec.europa.eu/taxation_customs/tedb/exportTax.html?taxId=4205&taxVersionDate=1672527600

可能有一种模式允许您将搜索条件粘贴在一起以生成下载链接，而不是使用RSelenium

归档时间：	2 年，10 月前
查看次数：	51 次
最近记录：	2 年，10 月前