Baz*_*zza 1 r web-scraping rvest
之前抓取过网页,现在返回403 Forbidden 错误。当我通过浏览器手动访问该网站时,没有任何问题,但是当我现在抓取页面时,我收到错误。
代码是:
url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")
Run Code Online (Sandbox Code Playgroud)
错误是:
Error in download.file(url, destfile = "webpage.html", quiet = TRUE) :
cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'
Run Code Online (Sandbox Code Playgroud)
我查看了文档并尝试在网上寻找答案,但到目前为止还没有运气。有什么建议我可以如何规避这个问题吗?
看起来他们添加了用户代理验证。您需要添加用户代理并且它可以工作。
如果您没有输入某些浏览器的用户代理,该网站会认为您是机器人并阻止您。这里有一些 python 代码。
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)
Run Code Online (Sandbox Code Playgroud)
使用用户代理在 R 中请求:
require(httr)
headers = c(
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)
res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1366 次 |
| 最近记录: |