MIT*_*THU -4 python middleware scrapy web-scraping python-3.x
我使用 scrapy 创建了一个脚本来从网页中获取一些字段。登陆页面的 url 和内部页面的 url 经常被重定向,因此我创建了一个中间件来处理该重定向。然而,当我看到这篇文章时,我明白我需要return request在process_request()用原始网址替换重定向网址后。
meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}当请求从蜘蛛发送时,它总是存在的。
由于所有请求都没有被重定向,我尝试替换_retry()方法中的重定向网址。
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
def process_exception(self, request, exception, spider):
return self._retry(request, spider)
def _retry(self, request, spider):
request.dont_filter = True
if request.meta.get('redirect_urls'):
redirect_url = request.meta['redirect_urls'][0]
redirected = request.replace(url=redirect_url)
redirected.dont_filter = True
return redirected
return request
def process_response(self, request, response, spider):
if response.status in [301, 302, 307, 429]:
return self._retry(request, spider)
return response
Run Code Online (Sandbox Code Playgroud)
问题:使用中间件将重定向的 url 替换为原始 url 后如何发送请求?
编辑:
我将其放在答案的开头,因为这是一种可能适合您的更快的一次性解决方案。
Scrapy 2.5 引入了get_retry_request,它允许您重试来自蜘蛛回调的请求。
来自文档:
返回一个新
Request对象以重试指定的请求,或者None如果指定请求的重试次数已用完。
所以你可以这样做:
def parse(self, response):
if response.status in [301, 302, 307, 429]:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='tried to redirect',
max_retry_times = 10
)
if new_request_or_none:
yield new_request_or_none
else:
# exhausted all retries
...
Run Code Online (Sandbox Code Playgroud)
但话又说回来,您应该确保仅在网站抛出状态代码以指示某些非永久性事件(例如重定向到维护页面)时重试从 3 开始的状态代码。至于状态 429,请参阅下面我关于使用延迟的建议。
编辑2:
在早于 21.7.0 的 Twisted 版本上,async_sleep使用的协程实现deferLater可能无法工作。使用这个代替:
from twisted.internet import defer, reactor
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
Run Code Online (Sandbox Code Playgroud)
原答案:
如果我理解正确的话,您只想在发生重定向时重试原始请求,对吗?
在这种情况下,您可以使用以下命令强制重试原本会重定向的请求RedirectMiddleware:
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class CustomRedirectMiddleware(RedirectMiddleware):
"""
Modifies RedirectMiddleware to set response status to 503 on redirects.
Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
(or whatever the downloader middleware responsible for retrying on status 503 is called).
"""
def process_response(self, request, response, spider):
if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list
return response.replace(status=503) # Now this response is RetryMiddleware's problem
return super().process_response(request, response, spider)
Run Code Online (Sandbox Code Playgroud)
但是,每次出现这些状态代码时重试可能会导致其他问题。因此,您可能想在 中添加一些附加条件if,例如检查某些标头是否存在,这些标头可以指示站点维护或类似的内容。
当我们讨论这个问题时,由于您在列表中包含了状态代码 429,我假设您可能会收到一些“请求过多”响应。您可能应该让您的蜘蛛等待一段时间,然后再重试此特定情况。这可以通过以下方式实现RetryMiddleware:
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, callable=None, *args, **kw):
return await task.deferLater(reactor, delay, callable, *args, **kw)
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs
MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
Run Code Online (Sandbox Code Playgroud)
DOWNLOADER_MIDDLEWARES不要忘记通过在项目中编辑来告诉 Scrapy 使用这些中间件settings.py:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'your_project_name.middlewares.CustomRedirectMiddleware': 600
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
618 次 |
| 最近记录: |