相关疑难解决方法(0)

脚本中的Scrapy爬行总是在抓取后阻止脚本执行

我按照本指南http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script从我的脚本运行scrapy.这是我的脚本的一部分:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"
Run Code Online (Sandbox Code Playgroud)

它应该工作:访问页面,刮取所需信息并存储我告诉它的输出json(通过FEED_URI).但是当蜘蛛完成他的工作时(我可以通过输出json中的数字看到它)我的脚本执行不会恢复.可能它不是scrapy问题.并且应该在扭曲的反应堆中找到答案.我怎么能释放线程执行?

python twisted scrapy

19
推荐指数
2
解决办法
8383
查看次数

Scrapy - Reactor无法重启

有:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
Run Code Online (Sandbox Code Playgroud)

我总是成功地运行这个过程:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start() 
Run Code Online (Sandbox Code Playgroud)

但是因为我已将此代码移动到web_crawler(self)函数中,如下所示:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2) 
Run Code Online (Sandbox Code Playgroud)

并开始使用类实例化调用该方法,如:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]
Run Code Online (Sandbox Code Playgroud)

和运行:

test()
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 573, in <module> …
Run Code Online (Sandbox Code Playgroud)

python web-crawler scrapy

14
推荐指数
2
解决办法
1万
查看次数

标签 统计

python ×2

scrapy ×2

twisted ×1

web-crawler ×1