Cha*_*son 2 python function web-crawler while-loop scrapy
我想解析一个股票列表,所以我试图格式化我的start_urls列表的结尾,所以我可以添加符号而不是整个网址.
start_urls内部蜘蛛类stock_list方法:
class MySpider(BaseSpider):
symbols = ["SCMP"]
name = "dozen"
allowed_domains = ["yahoo.com"]
def stock_list(stock):
start_urls = []
for symb in symbols:
start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))
return start_urls
def parse(self, response):
hxs = HtmlXPathSelector(response)
revenue = hxs.select('//td[@align="right"]')
items = []
for rev in revenue:
item = DozenItem()
item["Revenue"] = rev.xpath("./strong/text()").extract()
items.append(item)
return items[0:3]
Run Code Online (Sandbox Code Playgroud)
如果我摆脱它stock_list并且只是start_urls像往常一样简单地运行它,但是因为它当前不会导出多于空文件.
另外,我应该尝试一下sys.arv设置,以便在运行时只需在命令行输入股票代码作为参数$ scrapy crawl dozen -o items.csv???
通常,shell会2015-04-25 14:50:57-0400 [dozen] DEBUG: Crawled (200) <GET http://finance.yahoo.com/q/is?s=SCMP+Income+Statement&annual>在LOG/DEBUG打印输出中打印出来,但是当前不包含它,这意味着它没有正确格式化start_urls
实现动态启动URL的正确方法是使用start_request().当您有一个起始URL的静态列表时,
使用start_urls是首选做法.
start_requests()此方法必须返回一个带有第一个爬网请求的iterable.
例:
class MySpider(BaseSpider):
name = "dozen"
allowed_domains = ["yahoo.com"]
stock = ["SCMP", "APPL", "GOOG"]
def start_requests(self):
BASE_URL = "http://finance.yahoo.com/q/is?s={}"
yield scrapy.Request(url=BASE_URL.format(s)) for s in self.stock
def parse(self, response):
# parse the responses here
pass
Run Code Online (Sandbox Code Playgroud)
通过这种方式,您还可以使用生成器而不是预生成列表,在大型情况下可以更好地扩展stock.