Jef*_*eff 21 python sorting asynchronous hashmap scrapy
所以,我的问题相对简单.我有一个蜘蛛爬行多个站点,我需要它按照我在代码中写入的顺序返回数据.它发布在下面.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
start_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
结果以随机顺序返回,例如它返回第29个,然后是第28个,然后是第30个.我已经尝试将调度程序顺序从DFO更改为BFO,以防万一是问题,但这并没有改变任何东西.
war*_*iuc 19
start_urls定义在start_requests方法中使用的url .parse下载页面时,将调用您的方法,并为每个起始URL添加响应.但你无法控制加载时间 - 第一个启动URL可能是最后一个parse.
解决方案 - 覆盖start_requests方法并meta使用priority密钥添加到生成的请求a .在parse提取此priority值并将其添加到item.在管道中做一些基于此值的事情.(我不知道为什么以及在哪里需要按此顺序处理这些网址).
或者让它有点同步 - 将这些启动URL存储在某个地方.把start_urls它们放在第一个.在parse处理第一个响应并生成项目,然后从您的存储中获取下一个URL并通过回调请求它parse.
San*_*pal 16
Scrapy'Request'现在具有优先级属性.http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects如果您在函数中有许多"请求"并且想要首先处理特定请求,则可以设置
Request
Scrapy将首先处理优先级为1的那个.
小智 8
谷歌小组讨论建议在Request对象中使用优先级属性.Scrapy保证默认情况下在DFO中抓取网址.但它并不能确保按照在解析回调中产生的顺序访问URL.
您希望返回一个请求数组,而不是生成请求对象,而这些请求将从中弹出对象,直到它为空.
你可以试试这样的吗?
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
def start_requests(self):
start_urls = reversed( [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
] )
return [ Request(url = start_url) for start_url in start_urls ]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
小智 6
There is a much easier way to make scrapy follow the order of starts_url: you can just uncomment and change the concurrent requests in settings.py to 1.
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
Run Code Online (Sandbox Code Playgroud)