k_w*_*wit 18 python twisted scrapy python-2.7
我得到的twisted.internet.error.ReactorNotRestartable错误,当我执行下面的代码:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
Run Code Online (Sandbox Code Playgroud)
它第一次起作用,然后我得到错误.我process每次创建变量,那么问题是什么?
默认情况下,CrawlerProcess的.start()将停止在所有爬虫完成它创建的扭曲反应器。
process.start(stop_after_crawl=False)如果process在每个迭代中创建,则应调用。
另一个选择是自己处理并使用Twisted反应堆CrawlerRunner。该文档有一个这样做的例子。
对于特定进程,一旦调用reactor.run()或process.start(),就无法重新运行这些命令。原因是反应堆无法重新启动。一旦脚本完成执行,反应器就会停止执行。
因此,如果需要多次运行反应器,最好的选择是使用不同的子进程。
您可以将 while 循环的内容添加到函数中(例如execute_crawling)。然后您可以简单地使用不同的子进程来运行它。为此,可以使用 python Process 模块。代码如下。
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Run Code Online (Sandbox Code Playgroud)
我能够像这样解决这个问题。process.start()应该只调用一次。
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
Run Code Online (Sandbox Code Playgroud)
小智 5
参考 http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
Run Code Online (Sandbox Code Playgroud)