我按照本指南http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script从我的脚本运行scrapy.这是我的脚本的一部分:
crawler = Crawler(Settings(settings))
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
print "It can't be printed out!"
Run Code Online (Sandbox Code Playgroud)
它应该工作:访问页面,刮取所需信息并存储我告诉它的输出json(通过FEED_URI).但是当蜘蛛完成他的工作时(我可以通过输出json中的数字看到它)我的脚本执行不会恢复.可能它不是scrapy问题.并且应该在扭曲的反应堆中找到答案.我怎么能释放线程执行?
有:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
Run Code Online (Sandbox Code Playgroud)
我总是成功地运行这个过程:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
Run Code Online (Sandbox Code Playgroud)
但是因为我已将此代码移动到web_crawler(self)函数中,如下所示:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
Run Code Online (Sandbox Code Playgroud)
并开始使用类实例化调用该方法,如:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
Run Code Online (Sandbox Code Playgroud)
和运行:
test()
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "test.py", line 573, in <module> …Run Code Online (Sandbox Code Playgroud)