Scrapy:收集重试消息

wln*_*ana 3 python scrapy scrapy-spider

爬行器有一个最大化次数,如此处所述.达到该目标后,我收到类似于以下内容的错误:

Gave up retrying <GET https:/foo/bar/123> (failed 3 times)

我相信消息是由代码产生在这里.

但是,我想做一些关于放弃的事情.具体来说,我想知道是否有可能:

  1. 提取123URL 的部分(ID)并将这些ID正确地写入单独的文件中.
  2. 访问meta原始信息request.本文档可能会有所帮助.

pau*_*rth 5

您可以子类化scrapy.contrib.downloadermiddleware.retry.RetryMiddleware和覆盖,_retry()以便根据请求执行您想要的任何操作.

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy import log

class CustomRetryMiddleware(RetryMiddleware):

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
                    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            # do something with the request: inspect request.meta, look at request.url...
            log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
Run Code Online (Sandbox Code Playgroud)

然后,这是一个引用您的自定义中间件组件的问题 settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
    'myproject.middlewares.CustomRetryMiddleware': 500,
}
Run Code Online (Sandbox Code Playgroud)