Scrapy - Reactor无法重启

dat*_*den 14 python web-crawler scrapy

有:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
Run Code Online (Sandbox Code Playgroud)

我总是成功地运行这个过程:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start() 
Run Code Online (Sandbox Code Playgroud)

但是因为我已将此代码移动到web_crawler(self)函数中,如下所示:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2) 
Run Code Online (Sandbox Code Playgroud)

并开始使用类实例化调用该方法,如:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]
Run Code Online (Sandbox Code Playgroud)

和运行:

test()
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Run Code Online (Sandbox Code Playgroud)

怎么了?

Fer*_*ard 23

您无法重新启动反应堆,但您应该能够通过分支单独的进程来运行它多次:

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result
Run Code Online (Sandbox Code Playgroud)

运行两次:

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)
Run Code Online (Sandbox Code Playgroud)

结果:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
Run Code Online (Sandbox Code Playgroud)

  • 此解决方案有效。使用 Jupyter (Google Colab) 对其进行了测试。[⚠️注意⚠️] 有一个重要的警告:第一次使用时必须重新启动运行时。否则,您之前反应堆的臃肿尸体仍然存在,因此您的分叉进程也将携带它们。之后,一切都会顺利运行,因为父进程将不再接触它自己的反应器。 (4认同)
  • 有关于“AttributeError: Can't pickle local object 'run_spider.&lt;locals&gt;.f'”的小问题,但将名为“f”的函数移到外部解决了我的问题,我可以运行代码 (3认同)
  • 尝试运行上述代码时出现错误:`AttributeError:无法腌制本地对象'run_spider。&lt;locals&gt; .f'` (2认同)
  • 我注意到在 WSL 中运行 python 时相同的代码运行顺利。所以这似乎是 windows 下 python 的一个问题。 (2认同)

Chi*_*fir 9

这有助于我赢得反对ReactorNotRestartable错误的战斗:问题的作者的最后答案
0)pip install crochet
1)import from crochet import setup
2)setup()- 在文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()

我有同样的问题有了这个错误,并花了4个多小时来解决这个问题,请阅读有关它的所有问题.终于找到了一个 - 并分享它.这就是我解决这个问题的方法.来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs
Run Code Online (Sandbox Code Playgroud)

这段代码允许我选择运行什么蜘蛛,只需将其名称传递给run_spider函数,并在废弃完成后再选择另一个蜘蛛并再次运行它.
希望这会对某人有所帮助,因为它对我有帮助:)