vrl*_*oss 10 python twisted scrapy
作者说明:你可能认为这篇文章缺乏上下文或信息,这只是因为我不知道从哪里开始.我很乐意根据您的要求编辑其他信息.
运行scrapy我在所有链接中看到以下错误:
ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
d.addBoth(lambda _: slot.remove_request(request))
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Run Code Online (Sandbox Code Playgroud)
当我使用以下命令在该单个URL上运行scrappy时:
scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html
Run Code Online (Sandbox Code Playgroud)
没有发生错误.我正在废弃成千上万的类似链接没有问题,但我在~10个链接上看到了这个问题.我正在使用180
来自scrappy 的默认秒下载超时.我在网络浏览器中也没有看到这些链接有任何问题.
解析由请求启动:
request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)
Run Code Online (Sandbox Code Playgroud)
在功能中处理:
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
yield request
def parse_more(self, response):
#parsing more stuff here
return item
Run Code Online (Sandbox Code Playgroud)
另外:我没有在scrappy中更改下载重试的默认设置(但我在日志文件中也没有看到任何重试).
附加说明:
我的抓取完成后,因为dont_filter=True
我可以看到在某些时候无法下载前一个错误的链接,在上一个和后续请求中调用时没有失败.
可能的答案:
我看到我KeyError
在其中一只蜘蛛身上得到了一个蜘蛛,并且该蜘蛛的分配失败了(remove_request
).是否有可能是因为我dont_filter=True
在同一个URL上设置并执行多个请求,并且蜘蛛的密钥似乎是该URL?蜘蛛是否被同一URL上的先前并发请求取消分配?
在这种情况下,如何为每个请求创建一个唯一的密钥而不在URL上建立索引?
编辑
我认为我的代码parse_player
是问题,我不确定,因为我编辑了我的代码,但我记得看到一个不好的缩进yield request
.
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request
Run Code Online (Sandbox Code Playgroud)
如果您认为可能导致该问题,请告诉我.
如果你只是忽略错误?
def parse_player(self, response): if response.status == 200: paths = sel.xpath('some path extractor here') for path in paths: if (some_condition): #doing stuff there request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True) # Bad indent of yield request here! yield request
归档时间: |
|
查看次数: |
1048 次 |
最近记录: |