scrapy:当蜘蛛退出时调用一个函数

Abe*_*Abe 41 python scrapy

有没有办法在Spider类终止之前触发它?

我可以自己终止蜘蛛,像这样:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...
Run Code Online (Sandbox Code Playgroud)

但我找不到任何关于如何确定蜘蛛何时会自然戒烟的信息.

dm0*_*514 70

看起来你可以通过注册信号监听器dispatcher.

我会尝试类似的东西:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.
Run Code Online (Sandbox Code Playgroud)

  • 完美的工作.但我建议命名方法MySpider.quit()或类似的东西,以避免与信号名称混淆.谢谢! (4认同)
  • 在较新版本的 scrapy 中,`scrapy.xlib.pydispatch` 已被弃用。您可以使用“from pydispatch importdispatcher” (3认同)
  • 不适用于 v. 1.1,因为 xlib.pydispatch 已被弃用。相反,他们建议使用 PyDispatcher。虽然还不能让它工作...... (2认同)

THI*_*ELP 35

只是要更新,你可以closed像这样调用函数:

class MySpider(CrawlSpider):
    def closed(self, reason):
        do-something()
Run Code Online (Sandbox Code Playgroud)

  • 在我的scrapy中,它是'def close(自我,理性):`,而不是'封闭' (5认同)
  • @AminahNuraini Scrapy 1.0.4`def closed(reason)` (4认同)

Lev*_*von 13

对于Scrapy版本1.0.0+(它也适用于旧版本).

from scrapy import signals

class MySpider(CrawlSpider):
    name = 'myspider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        print('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        print('Closing {} spider'.format(spider.name))
Run Code Online (Sandbox Code Playgroud)

一个很好的用法是添加tqdm进度条到scrapy蜘蛛.

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tqdm import tqdm


class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['somedomain.comm']
    start_urls = ['http://www.somedomain.comm/ccid.php']

    rules = (
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccds.php\?id=.*'),
             callback='parse_item',
             ),
        Rule(LinkExtractor(allow=r'^http://www.somedomain.comm/ccid.php$',
                           restrict_xpaths='//table/tr[contains(., "SMTH")]'), follow=True),
    )

    def parse_item(self, response):
        self.pbar.update()  # update progress bar by 1
        item = MyItem()
        # parse response
        return item

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.pbar = tqdm()  # initialize progress bar
        self.pbar.clear()
        self.pbar.write('Opening {} spider'.format(spider.name))

    def spider_closed(self, spider):
        self.pbar.clear()
        self.pbar.write('Closing {} spider'.format(spider.name))
        self.pbar.close()  # close progress bar
Run Code Online (Sandbox Code Playgroud)


Chr*_*ris 7

对我来说,接受不起作用/至少对scrapy 0.19来说已经过时了.我让它与以下工作:

from scrapy.signalmanager import SignalManager
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        SignalManager(dispatcher.Any).connect(
            self.closed_handler, signal=signals.spider_closed)

    def closed_handler(self, spider):
        # do stuff here
Run Code Online (Sandbox Code Playgroud)


小智 7

对于最新版本(v1.7),只需closed(reason)在您的蜘蛛类中定义方法。

closed(reason)

当蜘蛛关闭时调用。此方法为spider_closed 信号提供了signals.connect() 的快捷方式。

Scrapy 文档:scrapy.spiders.Spider.closed