用scrapy下载图片

ibl*_*vic 7 python scrapy

我从scrapy开始,我有第一个真正的问题.它正在下载图片.所以这是我的蜘蛛.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ProductItem
from scrapy.utils.response import get_base_url

import re

class ProductSpider(CrawlSpider):
    name = "product"
    allowed_domains = ["domain.com"]
    start_urls = [
            "http://www.domain.com/category/supplies/accessories.do"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select('//td[@class="thumbtext"]')
        number = 0
        for site in sites:
            item = ProductItem()
            xpath = '//div[@class="thumb"]/img/@src'
            item['image_urls'] = site.select(xpath).extract()[number]
            item['image_urls'] = 'http://www.domain.com' + item['image_urls']
            items.append(item)
            number = number + 1
        return items
Run Code Online (Sandbox Code Playgroud)

当我引用ITEM_PIPELINESIMAGES_STOREsettings.py这种方式我得到我想下载的图片的正确URL(复制粘贴到浏览器中进行检查).

但当我取消引用时,我得到以下错误:

raise ValueError('Missing scheme in request url: %s' % self._url')
exceptions.ValueError: Missing scheme in request url:h
Run Code Online (Sandbox Code Playgroud)

我无法下载我的照片.

我搜索了一整天,没有找到任何有用的东西.

war*_*iuc 12

我认为您抓取的图片网址是相对的.要构造绝对URL,请使用urlparse.urljoin:

def parse(self, response):
    ...
    image_relative_url = hxs.select("...").extract()[0]
    import urlparse
    image_absolute_url = urlparse.urljoin(response.url, image_relative_url.strip())
    item['image_urls'] = [image_absolute_url]
    ...
Run Code Online (Sandbox Code Playgroud)

没有使用过ITEM_PIPELINES,但文档说:

在Spider中,您刮取一个项目并将其图像的URL放入image_urls字段中.

因此,item ['image_urls']应该是图片网址列表.但是你的代码有:

item['image_urls'] = 'http://www.domain.com' + item['image_urls']
Run Code Online (Sandbox Code Playgroud)

所以,我猜它会通过char迭代你的单个URL char - 使用每个URL作为URL.


ddn*_*ddn 6

我认为您可能需要在列表中为项目提供图像URL:

item['image_urls'] = [ 'http://www.domain.com' + item['image_urls'] ]
Run Code Online (Sandbox Code Playgroud)