Muc*_*ing 1 python scrapy web-scraping
我有一个 Scrapy 多级蜘蛛,它在本地工作,但在每次请求时都在 Cloud 中返回 GeneratorExit。
这是解析方法:
def parse(self, response):
results = list(response.css(".list-group li a::attr(href)"))
for c in results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
yield response.follow(c,
callback=self.parse_category,
meta=meta,
errback=self.errback_httpbin)
def parse_category(self, response):
category_results = list(response.css(
".item a.link-unstyled::attr(href)"))
category = response.css(".active [itemprop='title']")
for r in category_results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
meta["category"] = category
yield response.follow(r, callback=self.parse_item,
meta=meta,
errback=self.errback_httpbin)
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
Run Code Online (Sandbox Code Playgroud)
这是回溯:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
[stderr] Exception ignored in: <generator object iter_errback at 0x7fdea937a9e8>
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1243, in run
self.mainLoop()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1252, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 671, in _tick
taskObj._oneWorkUnit()
--- <exception caught here> ---
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 57, in enqueue_request
dqok = self._dqpush(request)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 86, in _dqpush
self.dqs.push(reqd, -request.priority)
File "/usr/local/lib/python3.6/site-packages/queuelib/pqueue.py", line 35, in push
q.push(obj) # this may fail (eg. serialization error)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 15, in push
s = serialize(obj)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 27, in _pickle_serialize
return pickle.dumps(obj, protocol=2)
builtins.TypeError: can't pickle HtmlElement objects
Run Code Online (Sandbox Code Playgroud)
我设置了 errback 但它没有提供任何错误详细信息。此外,我在每个请求中都写了 meta,但这没有任何区别。我错过了什么吗?
更新: 似乎该错误尤其是多级蜘蛛所固有的。现在,我只用一种解析方法重写了这个。
其中一个本地运行的作业之间和Scrapy云不同的是,JOBDIR设置启用,这使得Scrapy连载请求到磁盘队列,而不是记忆之一。
序列化到磁盘时,Pickle 操作失败,因为您的request.metadict 包含一个SelectorList对象(在行中分配category = response.css(".active [itemprop='title']")),并且选择器包含lxml.html.HtmlElement对象的实例(无法进行pickle,并且此问题不在 Scrapy 范围内),因此TypeError: can't pickle HtmlElement objects.
有一个合并的拉取请求可以解决这个问题。它没有修复 Pickle 操作,它的作用是指示调度程序它不应该尝试将这些类型的请求序列化到磁盘,而是转到内存。
| 归档时间: |
|
| 查看次数: |
417 次 |
| 最近记录: |