我正在用 scrapy 创建一个蜘蛛,我想使用 MySQL 数据库来获取我的蜘蛛中的 start_urls 。现在我想知道是否可以将scrapy-cloud连接到远程数据库?
截至最近,Scrapinghub 的免费包中不再包含定期作业,这是我用来运行我的 Scrapy 爬虫的工具。
\n\n因此,我决定改用Scrapyd。所以我继续建立了一个运行 Ubuntu 16.04 的虚拟服务器。(这是我第一次设置和运行服务器,所以请耐心等待)
\n\n按照scrapyd.readthedocs.io上的说明,我使用 pip 安装了 Scrapyd:
\n\n$ pip install scrapyd\nRun Code Online (Sandbox Code Playgroud)\n\n(那是在我发现 Ubuntu 的推荐方式(使用 apt-get)实际上不再受支持之后,请参阅Github)。
\n\n然后我使用 SSH 登录到我的服务器,并通过简单地运行来运行 Scrapyd
\n\n$ scrapyd\nRun Code Online (Sandbox Code Playgroud)\n\n据我所知,一切看起来都很好:
\n\n2017-10-30 17:31:19+0000 [-] Log opened.\n2017-10-30 17:31:19+0000 [-] twistd 16.0.0 (/usr/bin/python 2.7.12) starting up.\n2017-10-30 17:31:19+0000 [-] reactor class: twisted.internet.epollreactor.EPollReactor.\n2017-10-30 17:31:19+0000 [-] Site starting on 6800\n2017-10-30 17:31:19+0000 [-] Starting factory <twisted.web.server.Site instance at 0x7f644752bfc8>\n2017-10-30 17:31:19+0000 [Launcher] Scrapyd 1.2.0 started: max_proc=4, runner=u\'scrapyd.runner\'\nRun Code Online (Sandbox Code Playgroud)\n\n … 我可以在本地运行我的scrapy没有任何问题,但是,当我尝试从scrapinghub运行工作时我得到以下错误(连接到mongo atlas云):
exceptions.ImportError: No module named pymodm
Run Code Online (Sandbox Code Playgroud)
我导入使用:
import pymodm
Run Code Online (Sandbox Code Playgroud)
任何帮助深表感谢.
干杯
我正在使用Python 3进行scrapy项目,并将蜘蛛部署到scrapinghub.我也使用谷歌云存储来存储这里的官方文档中提到的已删除文件.
当我在本地运行蜘蛛并且蜘蛛被部署到scrapinghub而没有任何错误时,蜘蛛运行得非常好.我正在使用scrapy:1.4-py3作为scrapinghub的堆栈.在运行蜘蛛时,我收到以下错误:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File …Run Code Online (Sandbox Code Playgroud) scrapy python-3.x google-cloud-storage scrapinghub google-cloud-platform
当我尝试将其部署到云中并遇到以下错误时。
Error: Deploy failed (400):
project: non_field_errors
Run Code Online (Sandbox Code Playgroud)
我当前的设置如下。
def __init__(self, startUrls, *args, **kwargs):
self.keywords = ['sales','advertise','contact','about','policy','terms','feedback','support','faq']
self.startUrls = startUrls
self.startUrls = json.loads(self.startUrls)
super(MySpider, self).__init__(*args, **kwargs)
def start_requests(self):
for url in self.startUrls:
yield Request(url=url)
Run Code Online (Sandbox Code Playgroud) scrapinghub ×5
scrapy ×5
mongodb ×1
mysql ×1
pymodm ×1
python-2.7 ×1
python-3.x ×1
scrapyd ×1
ubuntu ×1