如何在抓取过程中动态生成start_urls?

use*_*269 24 web-crawler scrapy web-scraping

我正在抓取一个网站,该网站可能包含很多start_urls,例如http://www.a.com/list_1_2_3.htm.

我想填充[list_\d + \ d +\d + .htm]之类的start_urls ,并在抓取过程中从[node_\d + .htm]等网址中提取项目.

我可以使用CrawlSpider来实现这个功能吗?如何在爬行中动态生成start_urls?

非常感谢!

小智 43

动态生成URL的最佳方法是覆盖spider 的start_requests方法:

from scrapy.http.request import Request

def start_requests(self):
      with open('urls.txt', 'rb') as urls:
          for url in urls:
              yield Request(url, self.parse)
Run Code Online (Sandbox Code Playgroud)


akh*_*hab 14

有两个问题:

1)是的,您可以通过使用规则来实现此功能,例如,

rules =(Rule(SgmlLinkExtractor(allow = ('node_\d+.htm')) ,callback = 'parse'))
Run Code Online (Sandbox Code Playgroud)

建议阅读

2)是的,你可以动态生成start_urls,start_urls是a

名单

例如 >>> start_urls = ['http://www.a.com/%d_%d_%d' %(n,n+1,n+2) for n in range(0, 26)]

>>> start_urls

['http://www.a.com/0_1_2', 'http://www.a.com/1_2_3', 'http://www.a.com/2_3_4', 'http://www.a.com/3_4_5', 'http://www.a.com/4_5_6', 'http://www.a.com/5_6_7',  'http://www.a.com/6_7_8', 'http://www.a.com/7_8_9', 'http://www.a.com/8_9_10','http://www.a.com/9_10_11', 'http://www.a.com/10_11_12', 'http://www.a.com/11_12_13', 'http://www.a.com/12_13_14', 'http://www.a.com/13_14_15', 'http://www.a.com/14_15_16', 'http://www.a.com/15_16_17', 'http://www.a.com/16_17_18', 'http://www.a.com/17_18_19', 'http://www.a.com/18_19_20', 'http://www.a.com/19_20_21', 'http://www.a.com/20_21_22', 'http://www.a.com/21_22_23', 'http://www.a.com/22_23_24', 'http://www.a.com/23_24_25', 'http://www.a.com/24_25_26', 'http://www.a.com/25_26_27']
Run Code Online (Sandbox Code Playgroud)