Scrapy 检测 Xpath 是否不存在

Question

Scrapy 检测 Xpath 是否不存在

Eli*_*elo 1 xpath web-crawler scrapy web-scraping python-2.7

我一直在尝试制作我的第一个爬虫，我已经完成了我需要的东西（获取 1º 商店和 2º 商店的运输信息和价格）但是使用 2 个爬虫而不是 1 个，因为我在这里有一个很大的塞子。

当有超过 1 个商店时，输出结果为：

In [1]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()').extract()
Out[1]: 
[u'ENV\xcdO 3,95\u20ac ',
 u'ENV\xcdO GRATIS',
 u'ENV\xcdO GRATIS',
 u'ENV\xcdO 4,95\u20ac ']

Run Code Online (Sandbox Code Playgroud)

为了只获得我正在使用的第二个结果：

In [2]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract()
Out[2]: u'ENV\xcdO GRATIS'

Run Code Online (Sandbox Code Playgroud)

但是当没有第二个结果（只有 1 个商店）时，我得到：

IndexError: list index out of range

Run Code Online (Sandbox Code Playgroud)

即使其他项目有数据，爬虫也会跳过整个页面......

在尝试了几次之后，我决定做一个快速的解决方案来获得结果，2 个爬虫 1 个用于第一家商店，另一个用于第二家，但现在我只想用 1 个履带式清洁。

一些帮助，提示或建议将不胜感激，这是我第一次尝试使用scrapy制作递归爬虫，有点像它。

有代码：

# -*- coding: utf-8 -*-
import scrapy
from Guapalia.items import GuapaliaItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GuapaliaSpider(CrawlSpider):
    name = "guapalia"
    allowed_domains = ["guapalia.com"]
    start_urls = (
        'https://www.guapalia.com/perfumes?page=1',
        'https://www.guapalia.com/maquillaje?page=1',
        'https://www.guapalia.com/cosmetica?page=1',
        'https://www.guapalia.com/linea-de-bano?page=1',
        'https://www.guapalia.com/parafarmacia?page=1',
        'https://www.guapalia.com/solares?page=1',
        'https://www.guapalia.com/regalos?page=1',
    )
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='js-pager']/a[contains(text(),'Siguientes')]"),follow=True),
        Rule(LinkExtractor(restrict_xpaths="//div[@class='list-display__item list-display__item--product']/div/a[@class='col-xs-10 col-sm-10 col-md-12 clickOnProduct']"),callback='parse_articles',follow=True),
    )
    def parse_articles(self, response):
        item = GuapaliaItem()
        articles_urls = response.url
        articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract()
        articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract()
        articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title')[1].extract()
        articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract()
        articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract()
        item['articles_urls'] = articles_urls
        item['articles_first_shop'] = articles_first_shop
        item['articles_first_shipping'] = articles_first_shipping
        item['articles_second_shop'] = articles_second_shop if articles_second_shop else 'N/A'
        item['articles_second_shipping'] = articles_second_shipping
        item['articles_name'] = articles_name
        yield item

Run Code Online (Sandbox Code Playgroud)

当商店超过 1 家时，具有正确格式的爬虫的基本输出：

2017-09-21 09:53:11 [scrapy] DEBUG: Crawled (200) <GET https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355> (referer: https://www.guapalia.com/perfumes?page=1)
2017-09-21 09:53:11 [scrapy] DEBUG: Scraped from <200 https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355>
{'articles_first_shipping': [u'ENV\xcdO GRATIS'],
 'articles_first_shop': [u'DOUGLAS'],
 'articles_name': [u'ZEN edp vaporizador 100 ml'],
 'articles_second_shipping': u'ENV\xcdO 3,99\u20ac ',
 'articles_second_shop': u'BUYSVIP',
 'articles_urls': 'https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355'}

Run Code Online (Sandbox Code Playgroud)

问题是什么时候不存在第二家商店，因为我在第二家商店的字段上的代码

IndexError：列表索引超出范围

解决方案感谢@Tarun Lalwani

def parse_articles(self, response):
    item = GuapaliaItem()
    articles_urls = response.url
    articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract()
    articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract()
    articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title')
    articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')
    articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract()
    if len(articles_second_shop) > 1:
        item['articles_second_shop'] = articles_second_shop[1].extract()
    else:
        item['articles_second_shop'] = 'Not Found'
    if len(articles_second_shipping) > 1:
        item['articles_second_shipping'] = articles_second_shipping[1].extract()
    else:
        item['articles_second_shipping'] = 'Not Found'
    item['articles_urls'] = articles_urls
    item['articles_first_shop'] = articles_first_shop
    item['articles_first_shipping'] = articles_first_shipping
    item['articles_name'] = articles_name
    yield item

Run Code Online (Sandbox Code Playgroud)

Answer 1

Tar*_*ani 5

您需要先在变量中获取结果。然后你可以根据它的长度做出决定

texts = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')

if len(texts) > 1:
   data = texts[1].extract()
elif len(text) == 1:
   data = texts[0].extract()
else
   data ="Not found"

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	2252 次
最近记录：	8 年，5 月前