用scrapy刮取图像数据

Pic*_*Man 9 python xpath scrapy

我正在使用Scrapy在amazon.com上刮取与产品相关的图像.我该如何解析图像数据?

我通常使用XPath.但是,我无法找到图像的XPath(除了缩略图).例如,这是我解析标题的方式.

title = response.xpath('//h1[@id="title"]/span/text()').extract()
Run Code Online (Sandbox Code Playgroud)

该项目的链接是:https://www.amazon.com/dp/B01N068GIX?psc = 1

Tom*_*art 7

似乎可以从页面源中存在的JavaScript中提取图像.我使用js2xml库将JavaScript源代码转换为XML(您可以在Scrapinghub的blogpost上了解更多信息).然后可以使用XML创建一个Selector可以像往常一样提取数据的XML .看看这个示例蜘蛛:

# -*- coding: utf-8 -*-                                                         
import js2xml                                                                   
import scrapy                                                                   

class ExampleSpider(scrapy.Spider):                                             
    name = 'example'                                                            
    allowed_domains = ['amazon.com']                                            
    start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']                

    def parse(self, response):                                                  
        item = dict()
        js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
        xml = js2xml.parse(js)                                                  
        selector = scrapy.Selector(root=xml)                                   
        item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
        yield item
Run Code Online (Sandbox Code Playgroud)

如果您想测试它,请运行它

scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"
Run Code Online (Sandbox Code Playgroud)

因为亚马逊似乎基于用户代理字符串来阻止Scrapy.