我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法.我找到了两个解释这个的来源:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
我无法弄清楚我应该在哪里放置我的蜘蛛代码以及如何从主函数中调用它.请帮忙.这是示例代码:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the …Run Code Online (Sandbox Code Playgroud) 我在django视图中调用scrapy蜘蛛时遇到了麻烦.我怎样才能做到这一点?我尝试按照本教程http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/,但在导入设置中无效.
当我从命令行运行它时,我的刮刀工作正常,但是当我尝试在python脚本中运行它时(使用Twisted 这里概述的方法)它不会输出它通常执行的两个CSV文件.我有一个管道来创建和填充这些文件,其中一个使用CsvItemExporter(),另一个使用writeCsvFile().这是代码:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
Run Code Online (Sandbox Code Playgroud)
这是我的文件结构:
SiteCrawler/ # the CSVs are normally created …Run Code Online (Sandbox Code Playgroud) 我的草率代码如下所示:
import scrapy
from scrapy.crawler import CrawlerProcess
class MovieSpider(scrapy.Spider):
name = "movies"
start_urls = [
'https://movie.douban.com/subject/25934014/',
'https://movie.douban.com/subject/25852314/',
]
def parse(self, response):
title = response.css('div#wrapper div#content h1 span::text').extract_first()
year = response.css('div#wrapper div#content h1 span.year::text').extract_first()
yield {
'url': response.url,
'title': title,
'year': year,
}
Run Code Online (Sandbox Code Playgroud)
我这样运行
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'movie.json',
'FEED_EXPORT_ENCODING':'utf-8'
})
process.crawl(MovieSpider)
process.start() #
Run Code Online (Sandbox Code Playgroud)
这是docs中推荐的方式。
问题在于,运行上述脚本后,无法再次运行它。Jupyter Notebook返回错误ReactorNotRestartable
如果我在jupyter中重新启动内核,则可以第一次运行。
我认为问题是在脚本的Scrapy抓取中指出的,始终在抓取后阻止脚本执行
我可能可以通过使用他们的代码来解决此问题。但是,对于这么小的事情,它们的代码非常复杂,并且CrawlerProcess与文档中推荐的方式相去甚远。
我想知道是否有更好的方法来解决这个问题? …
这是我的代码
class Test(Spider):
self.settings.overrides['JOBDIR']= "seen"
Run Code Online (Sandbox Code Playgroud)
我有:
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 46, in settings
return self.crawler.settings
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 41, in crawler
assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
AssertionError: Spider not bounded to any crawler
Run Code Online (Sandbox Code Playgroud)
我正在扩展Spider,我没有使用,Crawler因为我没有链接或规则可以遵循
我猜我的问题是因为我没有很好地导入设置,我需要你的帮助
python ×5
scrapy ×5
python-2.7 ×2
web-crawler ×2
django ×1
export ×1
twisted ×1
web-scraping ×1