使用scrapy进行while循环时出现ReactorNotRestartable错误

k_w*_*wit 18 python twisted scrapy python-2.7

我得到的twisted.internet.error.ReactorNotRestartable错误,当我执行下面的代码:

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')
    process.start()

    if result:
        break
    sleep(3)
Run Code Online (Sandbox Code Playgroud)

它第一次起作用,然后我得到错误.我process每次创建变量,那么问题是什么?

pau*_*rth 6

默认情况下,CrawlerProcess.start()将停止在所有爬虫完成它创建的扭曲反应器。

process.start(stop_after_crawl=False)如果process在每个迭代中创建,则应调用。

另一个选择是自己处理并使用Twisted反应堆CrawlerRunner该文档有一个这样做的例子

  • `process.start(stop_after_crawl = False)`-将阻止主进程 (13认同)

Gih*_*age 6

对于特定进程,一旦调用reactor.run()或process.start(),就无法重新运行这些命令。原因是反应堆无法重新启动。一旦脚本完成执行,反应器就会停止执行。

因此,如果需要多次运行反应器,最好的选择是使用不同的子进程。

您可以将 while 循环的内容添加到函数中(例如execute_crawling)。然后您可以简单地使用不同的子进程来运行它。为此,可以使用 python Process 模块。代码如下。

from multiprocessing import Process
def execute_crawling():
    process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
    dispatcher.connect(set_result, signals.item_scraped)
    process.crawl('my_spider')
    process.start()

if __name__ == '__main__':
for k in range(Number_of_times_you_want):
    p = Process(target=execute_crawling)
    p.start()
    p.join() # this blocks until the process terminates
Run Code Online (Sandbox Code Playgroud)


Sag*_*tha 5

我能够像这样解决这个问题。process.start()应该只调用一次。

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')

process.start()
Run Code Online (Sandbox Code Playgroud)


小智 5

参考 http://crawl.blog/scrapy-loop/

 import scrapy
 from scrapy.crawler import CrawlerProcess
 from scrapy.utils.project import get_project_settings     
 from twisted.internet import reactor
 from twisted.internet.task import deferLater

 def sleep(self, *args, seconds):
    """Non blocking sleep callback"""
    return deferLater(reactor, seconds, lambda: None)

 process = CrawlerProcess(get_project_settings())

 def _crawl(result, spider):
    deferred = process.crawl(spider)
    deferred.addCallback(lambda results: print('waiting 100 seconds before 
    restart...'))
    deferred.addCallback(sleep, seconds=100)
    deferred.addCallback(_crawl, spider)
    return deferred


_crawl(None, MySpider)
process.start()
Run Code Online (Sandbox Code Playgroud)