从python scrapy中手动从parse（）请求URL或URL

Question

从python scrapy中手动从parse（）请求URL或URL

Its*_*thn 2 python request scrapy python-2.7 python-requests

我有一个简单的脚本，可以从亚马逊抓取数据，大家都知道有一个验证码，所以当验证码到达时页面标题是``机器人检查''，因此如果页面title = 'Robot check'和打印消息的页面不被抓取，则我为这种情况写了逻辑，有验证码在该页面上”，并且不会从该页面获取数据。否则继续执行脚本。

但是在if部分中，我尝试yield scrapy.Request(response.url, callback=self.parse)重新请求当前URL，但没有成功。我只需要做的就是重新请求response.url并继续执行脚本，因为这是我想我要做的是response.url从日志文件中删除该脚本，所以scrapy记不清URL的抓取方式，我必须愚弄scrapy并再次请求相同网址，或者是否有办法将其标记response.url为失败的网址，因此系统会自动重新请求。

这是简单的脚本，start_urls位于同一文件夹中名为urls的单独文件中，因此我已从urls文件中导入了它

import scrapy
import re
from urls import start_urls

class AmazondataSpider(scrapy.Spider):
    name = 'amazondata'
    allowed_domains = ['https://www.amazon.co.uk']
    def start_requests(self):
        for x in start_urls:
            yield scrapy.Request(x, self.parse)

    def parse(self, response):
        try:
            if 'Robot Check' == str(response.xpath('//title/text()').extract_first().encode('utf-8')):
                print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n'
                print 'URL : ',response.url,'\n\n'
                yield scrapy.Request(response.url, callback=self.parse)
            else:
                print '\n\nThere is a data in this page no robot check or captcha\n\n'
                pgtitle = response.xpath('//title/text()').extract_first().encode('utf-8')
                print '\n\n\nhello', pgtitle,'\n\n\n'
                if pgtitle == 'Robot check:
                    # LOGIC FOR GET DATA BY XPATH on RESPONSE
        except Exception as e:
            print '\n\n\n\n',e,'\n\n\n\n\n'

Run Code Online (Sandbox Code Playgroud)

Answer 1

Uma*_*air 5

告诉Scrapy不要过滤掉重复的链接，因为默认情况下，如果200Scrapy 已访问并已收到http_status ，则默认不会访问该链接。

做 dont_filter=True

就你而言

print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n'
print 'URL : ',response.url,'\n\n'
yield scrapy.Request(response.url, callback=self.parse, dont_filter=True)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	625 次
最近记录：	8 年，5 月前