处理Scrapy中的错误页面

Cry*_*pto 4 python web-crawler scrapy

我在start_urls中有一个URL

爬网程序第一次加载页面时,首先会显示403错误页面,然后爬网程序将关闭.

我需要做的是在该页面上填写验证码,然后让我访问该页面.我知道如何编写绕过验证码的代码,但是我将这些代码放在我的蜘蛛类中?

当遇到同样的问题时,我还需要在其他页面上添加它.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["mydomain.com"]
    start_urls = ["http://mydomain.com/categories"]
    handle_httpstatus_list = [403] #Where do I now add the captcha bypass code?
    download_delay = 5
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item (self, response):
        pass
Run Code Online (Sandbox Code Playgroud)

Ble*_*der 6

设置handle_httpstatus_list为处理403成功的响应代码:

class MySpider(CrawlSpider):
    handle_httpstatus_list = [403]
Run Code Online (Sandbox Code Playgroud)

至于绕过实际的验证码,您需要覆盖parse以不同的方式处理具有403响应代码的所有页面:

def parse(self, response):
    if response.status_code == 403:
        return self.handle_captcha(response):

    yield CrawlSpider.parse(self, response)

def handle_captcha(self, response):
    # Fill in the captcha and send a new request
    return Request(...)
Run Code Online (Sandbox Code Playgroud)