Scrapy下载错误和remove_request错误

vrl*_*oss 10 python twisted scrapy

作者说明:你可能认为这篇文章缺乏上下文或信息,这只是因为我不知道从哪里开始.我很乐意根据您的要求编辑其他信息.


运行scrapy我在所有链接中看到以下错误:

ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
    self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
    d.addBoth(lambda _: slot.remove_request(request))
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
    self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Run Code Online (Sandbox Code Playgroud)

当我使用以下命令在该单个URL上运行scrappy时:

scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html
Run Code Online (Sandbox Code Playgroud)

没有发生错误.我正在废弃成千上万的类似链接没有问题,但我在~10个链接上看到了这个问题.我正在使用180来自scrappy 的默认秒下载超时.我在网络浏览器中也没有看到这些链接有任何问题.

解析由请求启动:

  request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)
Run Code Online (Sandbox Code Playgroud)

在功能中处理:

  def parse_player(self, response):
    if response.status == 404:
       #doing stuff here
      yield item
    else:
      #doing stuff there
      request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
      yield request

  def parse_more(self, response):
    #parsing more stuff here
    return item
Run Code Online (Sandbox Code Playgroud)

另外:我没有在scrappy中更改下载重试的默认设置(但我在日志文件中也没有看到任何重试).

附加说明: 我的抓取完成后,因为dont_filter=True我可以看到在某些时候无法下载前一个错误的链接,在上一个和后续请求中调用时没有失败.

可能的答案: 我看到我KeyError在其中一只蜘蛛身上得到了一个蜘蛛,并且该蜘蛛的分配失败了(remove_request).是否有可能是因为我dont_filter=True在同一个URL上设置并执行多个请求,并且蜘蛛的密钥似乎是该URL?蜘蛛是否被同一URL上的先前并发请求取消分配?

在这种情况下,如何为每个请求创建一个唯一的密钥而不在URL上建立索引?


编辑

我认为我的代码parse_player是问题,我不确定,因为我编辑了我的代码,但我记得看到一个不好的缩进yield request.

  def parse_player(self, response):
    if response.status == 404:
       #doing stuff here
      yield item
    else:
      paths = sel.xpath('some path extractor here')
      for path in paths:
        if (some_condition):
          #doing stuff there
          request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
        # Bad indent of yield request here!
        yield request
Run Code Online (Sandbox Code Playgroud)

如果您认为可能导致该问题,请告诉我.

Eli*_*iro 5

如果你只是忽略错误?

 def parse_player(self, response):
    if response.status == 200:
      paths = sel.xpath('some path extractor here')
      for path in paths:
        if (some_condition):
          #doing stuff there
          request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
        # Bad indent of yield request here!
        yield request