救命!阅读源代码Scrapy对我来说并不容易.我有一个很长的start_urls清单.它在文件中约为3,000,000.所以,我start_urls这样做:
start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
for line in f:
try:
url = line.strip()
yield url
except:
print u"read line:%s from file failed!" % line
continue
print u"file read finish!"
Run Code Online (Sandbox Code Playgroud)
MeanWhile,我的蜘蛛的回调函数是这样的:
def parse(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.baidu.com"), callback=self.just_test1)
def just_test1(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.163.com"), callback=self.just_test2)
def just_test2(self, response):
self.log("Visited %s" % response.url)
return []
Run Code Online (Sandbox Code Playgroud)
我的问题是:
just_test1,just_test2 通过下载器只有所有后方可使用
start_urls,使用?(我做了一些测试,似乎答案是没有) …