Scrapy:ValueError('请求网址中缺少方案:%s'%self._url)

Question

Scrapy:ValueError('请求网址中缺少方案:%s'%self._url)

我试图从网页上抓取数据.该网页只是2500个URL的项目符号列表.Scrapy获取并转到每个URL并获取一些数据......

这是我的代码

class MySpider(CrawlSpider):
    name = 'dknews'
    start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
    allowed_domains = ['example.org']

    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"dkpagetype"})
        ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
        pturl = soup.find_all(attrs={"name":"dkpageurl"})
        ptdate = soup.find_all(attrs={"name":"dkpagedate"})
        ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
         for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
           ptbody = ''.join(node.find_all(text=True))  
           ptbody = ' '.join(ptbody.split())
           nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
           nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
           nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
           nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
           nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
           nf['bodytext'] = ptbody.encode('ascii', 'ignore')
         yield nf
            for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
             yield Request(url, callback=self.parse)

Run Code Online (Sandbox Code Playgroud)

现在的问题是上面的代码在2500篇文章中大约有215篇.它通过提供此错误关闭...

ValueError('请求网址中缺少方案:%s'%self._url)

我不知道是什么导致了这个错误....

非常感谢任何帮助.

谢谢

Answer 1

miz*_*gun 8

更新01/2019

Nowdays Scrapy的Response实例有一个非常方便的方法response.follow,它使用给定的URL(绝对或相对或甚至Link生成的对象LinkExtractor)生成Request response.url作为基础:

yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)

Run Code Online (Sandbox Code Playgroud)

文档:http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow

下面的代码看起来像是问题:

 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
     yield Request(url, callback=self.parse)

Run Code Online (Sandbox Code Playgroud)

如果任何网址没有完全合格,例如看起来href="/path/to/page"而不是href="http://example.com/path/to/page"你会得到错误.为确保您产生正确的请求,您可以使用urljoin:

    yield Request(response.urljoin(url), callback=self.parse)

Run Code Online (Sandbox Code Playgroud)

Scrapy的方法是使用LinkExtractor虽然https://doc.scrapy.org/en/latest/topics/link-extractors.html

归档时间：	8 年，9 月前
查看次数：	2987 次
最近记录：	6 年，9 月前