R 编程 download.file() 返回 403 Forbidden 错误

Baz*_*zza 1 r web-scraping rvest

之前抓取过网页,现在返回403 Forbidden 错误。当我通过浏览器手动访问该网站时,没有任何问题,但是当我现在抓取页面时,我收到错误。

代码是:

url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")
Run Code Online (Sandbox Code Playgroud)

错误是:

Error in download.file(url, destfile = "webpage.html", quiet = TRUE) : 
  cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
  cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'
Run Code Online (Sandbox Code Playgroud)

我查看了文档并尝试在网上寻找答案,但到目前为止还没有运气。有什么建议我可以如何规避这个问题吗?

MeT*_*MeT 5

看起来他们添加了用户代理验证。您需要添加用户代理并且它可以工作。
如果您没有输入某些浏览器的用户代理,该网站会认为您是机器人并阻止您。这里有一些 python 代码。

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)
Run Code Online (Sandbox Code Playgroud)

使用用户代理在 R 中请求:

require(httr)

headers = c(
  `user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)

res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))
Run Code Online (Sandbox Code Playgroud)