Python 3 Web Scraping中的HTTP错误403

Question

Python 3 Web Scraping中的HTTP错误403

Jos*_*osh 79 python http http-status-code-403 web

我试图废弃一个网站进行练习,但我继续得到HTTP错误403(它认为我是一个机器人)？

这是我的代码:

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

Run Code Online (Sandbox Code Playgroud)

我得到的错误是:

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ste*_*ppo 158

这可能是因为mod_security或类似的服务器安全功能阻止了已知的蜘蛛/僵尸用户代理(urllib使用类似的东西python urllib/3.3.0,它很容易被检测到).尝试使用以下设置已知的浏览器用户代理

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

Run Code Online (Sandbox Code Playgroud)

这适合我.

顺便说一句,在你的代码中你错过了()后面.read的urlopen行,但我认为这是一个错字.

提示:由于这是练习,因此请选择其他非限制性网站.也许他们urllib出于某种原因阻止......

可能有点晚了，但我的代码中已经有了用户代理，但它仍然给我“错误 404：访问被拒绝” (2认同)
不幸的是，这不适用于某些网站。不过，有一个“请求”解决方案/sf/ask/3156046841/。 (2认同)
有些网站也阻止“Mozilla/5.0”。您可能想尝试“Mozilla/6.0”或其他标头。 (2认同)

Answer 2

zet*_*eta 33

由于你使用了基于用户代理的urllib,它肯定是阻塞的.OfferUp也发生了同样的事情.您可以创建一个名为AppURLopener的新类,该类使用Mozilla覆盖用户代理.

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

Run Code Online (Sandbox Code Playgroud)

资源

最重要的答案对我不起作用，而你的答案。非常感谢！ (3认同)
看起来确实打开了，但显示“ ValueError：读取已关闭的文件” (2认同)
这有效，但会产生警告``DeprecationWarning：AppURLopener调用请求的风格已被弃用。在 python 3.7 中使用较新的 urlopen 函数/方法``` (2认同)

Answer 3

roy*_*rek 8

"这可能是因为mod_security或某些类似的服务器安全功能已知阻止

蜘蛛/机器人

用户代理(urllib使用类似python urllib/3.3.0的东西,很容易检测到)" - 正如Stefano Sanfilippo已经提到的那样

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

Run Code Online (Sandbox Code Playgroud)

该web_byte是由服务器和存在于网页中的内容类型返回的字节对象主要是UTF-8 .因此,您需要使用解码方法解码web_byte.

当我尝试使用PyCharm从网站上废弃时,这解决了完整的问题

PS - >我使用python 3.4

Answer 4

小智 6

根据以前的答案，通过将超时增加到 10，这对我适用于 Python 3.7。

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Run Code Online (Sandbox Code Playgroud)

Answer 5

小智 5

将 cookie 添加到请求标头对我有用

from urllib.request import Request, urlopen

# Function to get the page content
def get_page_content(url, head):
  """
  Function to get the page content
  """
  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，8 月前
查看次数：	114947 次
最近记录：	7 年，2 月前