该stockInfo.py包含:
import scrapy
import re
import pkgutil
class QuotesSpider(scrapy.Spider):
name = "stockInfo"
data = pkgutil.get_data("tutorial", "resources/urls.txt")
data = data.decode()
start_urls = data.split("\r\n")
def parse(self, response):
company = re.findall("[0-9]{6}",response.url)[0]
filename = '%s_info.html' % company
with open(filename, 'wb') as f:
f.write(response.body)
Run Code Online (Sandbox Code Playgroud)
stockInfo在窗口的cmd中执行蜘蛛程序。
d:
cd tutorial
scrapy crawl stockInfo
Run Code Online (Sandbox Code Playgroud)
现在,该URL中的所有网页都resources/urls.txt将下载到该目录中d:/tutorial。
然后将蜘蛛部署进去Scrapinghub,然后运行stockInfo spider。
没有错误发生,下载的网页在哪里?
以下命令行如何执行Scrapinghub?
with open(filename, 'wb') as f:
f.write(response.body)
Run Code Online (Sandbox Code Playgroud)
如何将数据保存在scrapinghub中,并在作业完成后从scrapinghub下载?
首先要安装scrapinghub。
pip install scrapinghub[msgpack]
Run Code Online (Sandbox Code Playgroud)
重写Thiago Curvelo一下,将其部署在我的scrapinghub中。
Deploy …Run Code Online (Sandbox Code Playgroud) 我从一个网站上获得了一个Scrapy示例,但它看起来有些不对劲:它无法获取所有内容,我不知道发生了什么.该示例使用Scrapy + Redis + MongoDB.
信息:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min)
2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), …Run Code Online (Sandbox Code Playgroud)