Python Mechanize HTTP 错误 403：robots.txt 不允许请求

Question

Python Mechanize HTTP 错误 403：robots.txt 不允许请求

1 python django robots.txt mechanize beautifulsoup

因此，我创建了一个 Django 网站来从网络上抓取新闻网页以获取文章。即使我使用 mechanize，他们仍然告诉我：

HTTP Error 403: request disallowed by robots.txt

Run Code Online (Sandbox Code Playgroud)

我尝试了一切，看看我的代码（只是要抓取的部分）：

br = mechanize.Browser()
page = br.open(web)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    #BeautifulSoup 
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

Run Code Online (Sandbox Code Playgroud)

我也尝试在 set_hande_robots(Flase) 等之前使用 de br.open 。它也不起作用。

有什么办法可以通过这个网站吗？

Answer 1

Cry*_*pto 5

您正在设置br.set_handle_robots(False)之后br.open()

它应该是：

br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(web)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，11 月前
查看次数：	4611 次
最近记录：	4 年，6 月前