Par*_*rag 2 python web-crawler scrapy
我正在使用scrapy来抓取超过400万种产品的产品网站。然而,在抓取大约 50k 产品后,它开始抛出 500 HTTP 错误。我已将 Auto throttling 设置为 false ,因为在启用其非常慢后,将需要大约 20-25 天才能完成抓取。我认为服务器在一段时间后开始暂时阻止爬虫。任何解决方案可以做什么?我正在使用站点地图爬虫 - 如果服务器没有响应,我想从 url 本身提取一些信息并继续下一个 url 而不是完成爬行并关闭蜘蛛,为此我正在查看请求中的 errback 参数。但是,由于我使用的是站点地图爬虫,因此我没有明确创建请求对象。是否有任何我可以覆盖的默认 errback 函数或者我可以在哪里定义它。
HTTP 500 typically indicates an internal server error. When getting blocked, it is much more likely you'd see a 403 or 404. (or perhaps a 302 redirect to a "you've been blocked" page) You're probably visiting links that cause something to break server-side. You should store which request caused the error and try visiting it yourself. It could be the case that the site is simply broken.
Ok..i get it but can you tell where and how to define errback function so that I can handle this error and my spider does not finishes
I took a look at SitemapSpider and unfortunately, it does not allow you to specify an errback function, so you're going to have to add support for it yourself. I'm basing this on the source for SitemapSpider.
首先,您将要sitemap_rules通过添加处理错误的函数来改变工作方式:
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
Run Code Online (Sandbox Code Playgroud)
会变成:
sitemap_rules = [
('/product/', 'parse_product', 'error_handler'),
('/category/', 'parse_category', 'error_handler'),
]
Run Code Online (Sandbox Code Playgroud)
接下来,在 中init,您想将新回调存储在 中_cbs。
for r, c in self.sitemap_rules:
if isinstance(c, basestring):
c = getattr(self, c)
self._cbs.append((regex(r), c))
Run Code Online (Sandbox Code Playgroud)
会变成:
for r, c, e in self.sitemap_rules:
if isinstance(c, basestring):
c = getattr(self, c)
if isinstance(e, basestring):
e = getattr(self, e)
self._cbs.append((regex(r), c, e))
Run Code Online (Sandbox Code Playgroud)
最后,在 结束时_parse_sitemap,您可以指定新的 errback 函数
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
Run Code Online (Sandbox Code Playgroud)
会变成:
elif s.type == 'urlset':
for loc in iterloc(s):
for r, c, e in self._cbs:
if r.search(loc):
yield Request(loc, callback=c, errback=e)
break
Run Code Online (Sandbox Code Playgroud)
从那里开始,简单地实现你的 errback 函数(请记住,它需要一个 Twisted Failure 作为参数),你应该很高兴。