使用中间件将重定向网址替换为原始网址后,无法以正确的方式发送请求

MIT*_*THU -4 python middleware scrapy web-scraping python-3.x

我使用 scrapy 创建了一个脚本来从网页中获取一些字段。登陆页面的 url 和内部页面的 url 经常被重定向,因此我创建了一个中间件来处理该重定向。然而,当我看到这篇文章时,我明白我需要return requestprocess_request()用原始网址替换重定向网址后。

meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}当请求从蜘蛛发送时,它总是存在的。

由于所有请求都没有被重定向,我尝试替换_retry()方法中的重定向网址。

def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

def process_exception(self, request, exception, spider):
    return self._retry(request, spider)

def _retry(self, request, spider):
    request.dont_filter = True
    if request.meta.get('redirect_urls'):
        redirect_url = request.meta['redirect_urls'][0]
        redirected = request.replace(url=redirect_url)
        redirected.dont_filter = True
        return redirected
    return request

def process_response(self, request, response, spider):
    if response.status in [301, 302, 307, 429]:
        return self._retry(request, spider)
    return response
Run Code Online (Sandbox Code Playgroud)

问题:使用中间件将重定向的 url 替换为原始 url 后如何发送请求?

Iva*_*nel 5

编辑:

我将其放在答案的开头,因为这是一种可能适合您的更快的一次性解决方案。

Scrapy 2.5 引入了get_retry_request,它允许您重试来自蜘蛛回调的请求。

来自文档:

返回一个新Request对象以重试指定的请求,或者None如果指定请求的重试次数已用完。

所以你可以这样做:

def parse(self, response):
    if response.status in [301, 302, 307, 429]:
        new_request_or_none = get_retry_request(
            response.request,
            spider=self,
            reason='tried to redirect',
            max_retry_times = 10
        )
        if new_request_or_none:
            yield new_request_or_none
        else:
            # exhausted all retries
            ...
Run Code Online (Sandbox Code Playgroud)

但话又说回来,您应该确保仅在网站抛出状态代码以指示某些非永久性事件(例如重定向到维护页面)时重试从 3 开始的状态代码。至于状态 429,请参阅下面我关于使用延迟的建议。

编辑2:

在早于 21.7.0 的 Twisted 版本上,async_sleep使用的协程实现deferLater可能无法工作。使用这个代替:

from twisted.internet import defer, reactor

async def async_sleep(delay, return_value=None):
    deferred = defer.Deferred()
    reactor.callLater(delay, deferred.callback, return_value)
    return await deferred
Run Code Online (Sandbox Code Playgroud)

原答案:

如果我理解正确的话,您只想在发生重定向时重试原始请求,对吗?

在这种情况下,您可以使用以下命令强制重试原本会重定向的请求RedirectMiddleware

# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware

class CustomRedirectMiddleware(RedirectMiddleware):
    """
    Modifies RedirectMiddleware to set response status to 503 on redirects.
    Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
    (or whatever the downloader middleware responsible for retrying on status 503 is called).
    """

    def process_response(self, request, response, spider):
        if response.status in (301, 302, 303, 307, 308):  # 429 already is in scrapy's default retry list
            return response.replace(status=503)  # Now this response is RetryMiddleware's problem

        return super().process_response(request, response, spider)
Run Code Online (Sandbox Code Playgroud)

但是,每次出现这些状态代码时重试可能会导致其他问题。因此,您可能想在 中添加一些附加条件if,例如检查某些标头是否存在,这些标头可以指示站点维护或类似的内容。

当我们讨论这个问题时,由于您在列表中包含了状态代码 429,我假设您可能会收到一些“请求过多”响应。您可能应该让您的蜘蛛等待一段时间,然后再重试此特定情况。这可以通过以下方式实现RetryMiddleware

# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

async def async_sleep(delay, callable=None, *args, **kw):
    return await task.deferLater(reactor, delay, callable, *args, **kw)

class TooManyRequestsRetryMiddleware(RetryMiddleware):
    """
    Modifies RetryMiddleware to delay retries on status 429.
    """

    DEFAULT_DELAY = 10  # Delay in seconds. Tune this to your needs
    MAX_DELAY = 60  # Sometimes, RETRY-AFTER has absurd values

async def process_response(self, request, response, spider):
    """
    Like RetryMiddleware.process_response, but, if response status is 429,
    retry the request only after waiting at most self.MAX_DELAY seconds.
    Respect the Retry-After header if it's less than self.MAX_DELAY.
    If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
    """

    if request.meta.get('dont_retry', False):
        return response

    if response.status in self.retry_http_codes:
        if response.status == 429:
            retry_after = response.headers.get('retry-after')
            try:
                retry_after = int(retry_after)
            except (ValueError, TypeError):
                delay = self.DEFAULT_DELAY
            else:
                delay = min(self.MAX_DELAY, retry_after)
            spider.logger.info(f'Retrying {request} in {delay} seconds.')

            spider.crawler.engine.pause()
            await async_sleep(delay)
            spider.crawler.engine.unpause()

        reason = response_status_message(response.status)
        return self._retry(request, reason, spider) or response

    return response
Run Code Online (Sandbox Code Playgroud)

DOWNLOADER_MIDDLEWARES不要忘记通过在项目中编辑来告诉 Scrapy 使用这些中间件settings.py

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
    'your_project_name.middlewares.CustomRedirectMiddleware': 600
}
Run Code Online (Sandbox Code Playgroud)