open.connection(x,"rb")中的rvest错误:达到了超时

Question

我正试图从http://google.com抓取内容.错误信息出来了.

library(rvest)  
html("http://google.com")

open.connection(x,"rb")出错:
达到了超时此外:
警告消息:不推荐使用'html'.
请改用"read_html".
请参阅帮助("已弃用")

因为我正在使用公司网络,这可能是由防火墙或代理引起的.我尝试使用set_config,但没有工作.

Answer 1

Error in open.connection(x, “rb”) : Timeout was reached在办公室网络中的代理后面工作时遇到了同样的问题.

这对我有用,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Answer 2

这可能是由于您对read_html（或您的情况下的html）的调用未正确地标识其正在尝试从中检索内容的服务器，这是默认行为。使用curl，将用户代理添加到read_html的handle参数中，以使您的抓取工具识别自己。

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))