我试图将用户定义的参数传递给scrapy的蜘蛛.任何人都可以建议如何做到这一点?
我在-a某处读到了一个参数,但不知道如何使用它.
我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法.我找到了两个解释这个的来源:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
我无法弄清楚我应该在哪里放置我的蜘蛛代码以及如何从主函数中调用它.请帮忙.这是示例代码:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the …Run Code Online (Sandbox Code Playgroud) 我想使用scrapy来抓取网页.有没有办法从终端本身传递起始URL?
在文档中给出了可以给出蜘蛛的名称或URL,但是当我给出url时它会抛出一个错误:
//我的蜘蛛的名字就是例子,但是我给的是url而不是我的蜘蛛名字(如果我给蜘蛛名字,它可以正常工作).
scrapy crawl example.com
错误:
文件"/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py",第43行,在create raise KeyError("未找到蜘蛛:% s"%spider_name"KeyError:'找不到蜘蛛:example.com'
如何让scrapy在终端上给出的url上使用我的蜘蛛?
There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpectedly.
All of the above questions, couldn't help me. Either, some of them have used CELERY or some of them SCRAPYD, and I want to use the MULTIPROCESSISNG Library. Also, the Scrapy Official Documentation shows how to run multiple spiders on a SINGLE PROCESS, not on MULTIPLE PROCESSES.
None of them couldn't help …
python scrapy web-scraping scrapy-spider python-multiprocessing
我可以使用wiki中的以下配方在python脚本中运行爬网:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我可以传递domain给FollowAllSpider我,但我的问题是我如何使用上面的代码将start_urls(实际上date将被添加到Fixed url中)传递给我的蜘蛛类?
这是我的蜘蛛类:
class MySpider(CrawlSpider):
name = 'tw'
def __init__(self,date):
y,m,d=date.split('-') #this is a test , it could split with regex!
try:
y,m,d=int(y),int(m),int(d)
except ValueError:
raise 'Enter a valid date'
self.allowed_domains = ['mydomin.com']
self.start_urls …Run Code Online (Sandbox Code Playgroud)