我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法.我找到了两个解释这个的来源:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
我无法弄清楚我应该在哪里放置我的蜘蛛代码以及如何从主函数中调用它.请帮忙.这是示例代码:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the …Run Code Online (Sandbox Code Playgroud) 我必须从另一个python文件中调用crawler,我使用以下代码.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
在运行它时,我得到错误
exceptions.ValueError: signal only works in main thread
Run Code Online (Sandbox Code Playgroud)
我能找到的唯一解决方法就是使用
reactor.run(installSignalHandlers=False)
Run Code Online (Sandbox Code Playgroud)
我不想使用,因为我想多次调用此方法,并希望在下一次调用之前停止reactor.我可以做些什么来完成这项工作(可能会强制爬虫在相同的'主'线程中启动)?
我想从脚本而不是脚本中运行我的蜘蛛 scrap crawl
我找到了这个页面
http://doc.scrapy.org/en/latest/topics/practices.html
但实际上并没有说该放哪个脚本.
有什么帮助吗?
官方教程指定了如何在python脚本中调用scrapy的方法
通过更改以下设置属性:
settings.overrides['FEED_URI'] = output_path
settings.overrides['FEED_FORMAT'] = 'json'
Run Code Online (Sandbox Code Playgroud)
我能够将数据存储在json文件中.
但是,我正在尝试处理并返回在我定义的函数内立即抓取的数据.因此,其他函数可以调用此包装函数以废弃某些网站.
我认为必须有一些我可以玩的设置FEED_URI,但我不确定.任何建议将深深感激!
我正在从脚本运行scrapy,但它所做的就是激活蜘蛛.它不会通过我的项目管道.我读过http://scrapy.readthedocs.org/en/latest/topics/practices.html,但它没有说出包含管道的任何内容.
我的设置:
Scraper/
scrapy.cfg
ScrapyScript.py
Scraper/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
my_spider.py
Run Code Online (Sandbox Code Playgroud)
我的剧本:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from Scraper.spiders.my_spider import MySpiderSpider
spider = MySpiderSpider(domain='myDomain.com')
settings = get_project_settings
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Reactor activated...')
reactor.run()
log.msg('Reactor stopped.')
Run Code Online (Sandbox Code Playgroud)
我的管道:
from scrapy.exceptions import DropItem
from scrapy import log
import sqlite3
class ImageCheckPipeline(object):
def process_item(self, item, spider):
if item['image']:
log.msg("Item added successfully.")
return …Run Code Online (Sandbox Code Playgroud) 在下面的文档中,我可以从Python脚本运行scrapy,但是我无法获得scrapy结果.
这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem
class DmozSpider(BaseSpider):
name = "douban"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/group/xxx/discussion"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
items = []
# print sites
for row in rows:
item = DmozItem()
item["title"] = row.select('text()').extract()[0]
item["link"] = row.select('@href').extract()[0]
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
注意最后一行,我尝试使用返回的解析结果,如果我运行:
scrapy crawl douban
Run Code Online (Sandbox Code Playgroud)
终端可以打印返回结果
但是我无法从Python脚本中获得返回结果.这是我的Python脚本:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy …Run Code Online (Sandbox Code Playgroud) 当我从命令行运行它时,我的刮刀工作正常,但是当我尝试在python脚本中运行它时(使用Twisted 这里概述的方法)它不会输出它通常执行的两个CSV文件.我有一个管道来创建和填充这些文件,其中一个使用CsvItemExporter(),另一个使用writeCsvFile().这是代码:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
Run Code Online (Sandbox Code Playgroud)
这是我的文件结构:
SiteCrawler/ # the CSVs are normally created …Run Code Online (Sandbox Code Playgroud) 我可以使用wiki中的以下配方在python脚本中运行爬网:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我可以传递domain给FollowAllSpider我,但我的问题是我如何使用上面的代码将start_urls(实际上date将被添加到Fixed url中)传递给我的蜘蛛类?
这是我的蜘蛛类:
class MySpider(CrawlSpider):
name = 'tw'
def __init__(self,date):
y,m,d=date.split('-') #this is a test , it could split with regex!
try:
y,m,d=int(y),int(m),int(d)
except ValueError:
raise 'Enter a valid date'
self.allowed_domains = ['mydomin.com']
self.start_urls …Run Code Online (Sandbox Code Playgroud) 我发现很多Scrapy教程(例如本教程)都需要下面列出的步骤。结果是一个项目,其中包含许多文件(project.cfg+一些.py文件+一个特定的文件夹结构)。
如何使这些步骤(如下所列)作为可与之一起运行的独立python文件工作 python mycrawler.py?
(而不是一个包含大量文件,一些.cfg文件等并且必须使用scrapy crawl myproject -o myproject.json... 的完整项目,这似乎scrapy是一个新的shell命令?这是真的吗?)
注意:这里可能是这个问题的答案,但不幸的是,它已被弃用,不再起作用。
1)使用创建一个新的scrapy项目 scrapy startproject myproject
2)Item像这样定义数据结构:
from scrapy.item import Item, Field
class MyItem(Item):
title = Field()
link = Field()
...
Run Code Online (Sandbox Code Playgroud)
3)使用定义爬虫
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "myproject"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
...
Run Code Online (Sandbox Code Playgroud)
4)运行:
scrapy crawl myproject …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用Flask和Scrapy构建应用程序.我必须将list网址传递给蜘蛛.我尝试使用以下语法:
__init__: in Spider
self.start_urls = ["http://www.google.com/patents/" + x for x in u]
Flask Method
u = ["US6249832", "US20120095946"]
os.system("rm static/s.json; scrapy crawl patents -d u=%s -o static/s.json" % u)
Run Code Online (Sandbox Code Playgroud)
我知道类似的事情可以通过阅读具有所需网址的文件来完成,但是我可以通过网址列表进行抓取吗?
我在一个文件中写了两个蜘蛛。当我运行时scrapy runspider two_spiders.py,只有第一个 Spider 被执行。如何在不将文件拆分为两个文件的情况下运行它们。
two_spiders.py:
import scrapy
class MySpider1(scrapy.Spider):
# first spider definition
...
class MySpider2(scrapy.Spider):
# second spider definition
...
Run Code Online (Sandbox Code Playgroud) python ×11
scrapy ×11
python-2.7 ×4
web-scraping ×4
twisted ×2
web-crawler ×2
export ×1
flask ×1
json ×1
sqlite ×1