Scrapy按顺序抓取网址

Jef*_*eff 21 python sorting asynchronous hashmap scrapy

所以,我的问题相对简单.我有一个蜘蛛爬行多个站点,我需要它按照我在代码中写入的顺序返回数据.它发布在下面.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items
Run Code Online (Sandbox Code Playgroud)

结果以随机顺序返回,例如它返回第29个,然后是第28个,然后是第30个.我已经尝试将调度程序顺序从DFO更改为BFO,以防万一是问题,但这并没有改变任何东西.

war*_*iuc 19

start_urls定义在start_requests方法中使用的url .parse下载页面时,将调用您的方法,并为每个起始URL添加响应.但你无法控制加载时间 - 第一个启动URL可能是最后一个parse.

解决方案 - 覆盖start_requests方法并meta使用priority密钥添加到生成的请求a .在parse提取此priority值并将其添加到item.在管道中做一些基于此值的事情.(我不知道为什么以及在哪里需要按此顺序处理这些网址).

或者让它有点同步 - 将这些启动URL存储在某个地方.把start_urls它们放在第一个.在parse处理第一个响应并生成项目,然后从您的存储中获取下一个URL并通过回调请求它parse.


San*_*pal 16

Scrapy'Request'现在具有优先级属性.http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects如果您在函数中有许多"请求"并且想要首先处理特定请求,则可以设置

Request

Scrapy将首先处理优先级为1的那个.


小智 8

谷歌小组讨论建议在Request对象中使用优先级属性.Scrapy保证默认情况下在DFO中抓取网址.但它并不能确保按照在解析回调中产生的顺序访问URL.

您希望返回一个请求数组,而不是生成请求对象,而这些请求将从中弹出对象,直到它为空.

你可以试试这样的吗?

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]

   def start_requests(self):
       start_urls = reversed( [
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
       ] )

       return [ Request(url = start_url) for start_url in start_urls ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items
Run Code Online (Sandbox Code Playgroud)


小智 6

There is a much easier way to make scrapy follow the order of starts_url: you can just uncomment and change the concurrent requests in settings.py to 1.

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1
Run Code Online (Sandbox Code Playgroud)

  • 或者你添加:`custom_settings = { 'CONCURRENT_REQUESTS': '1' }` 在`class DmozSpider(BaseSpider): name = "dmoz" `的正下方。这样你就不需要额外的 `settings.py` 文件。 (5认同)