Uma*_*air 6 python screen-scraping scrapy scrapy-spider
我有一个Scrapy Spider,它会抓取一个网站,并且该网站需要刷新令牌才能访问它们。
def get_ad(self, response):
temp_dict = AppextItem()
try:
Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
self.p_token = ''
return Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)
except Exception:
print("Captcha was not found")
Run Code Online (Sandbox Code Playgroud)
我有一种get_p_token刷新令牌并分配给的方法self.p_token
get_p_token 找到验证码后调用,但问题是其他请求继续执行。
我希望如果找到了验证码,请不要执行下一个请求,直到执行get_p_token完成。
我有,priority=1但这无济于事。
PS:
实际上,令牌是传递给每个URL的,这就是为什么我要等到找到新令牌,然后再抓取其余URL。
我将这样继续下去:
def get_p_token(self, response):
# generate token
...
yield Request(url = response.url, callback=self.no_captcha, method = "GET",priority=1, meta = response.meta, dont_filter=True)
def get_ad(self, response):
temp_dict = AppextItem()
try:
Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
self.p_token = ''
yield Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)
except Exception:
print("Captcha was not found")
yield Request(url = url_, callback=self.no_captcha, method = "GET",priority=1, meta = response.meta)
Run Code Online (Sandbox Code Playgroud)
您尚未提供工作代码,因此这只是问题的演示...这里的逻辑非常简单:
如果找到验证码get_p_token,它会在生成令牌后转到您之前请求的 url。如果没有找到验证码,它将正常进行。
| 归档时间: |
|
| 查看次数: |
1368 次 |
| 最近记录: |