我想从这里下载所有产品的图像.我的蜘蛛看起来像:
from shopclues.items import ImgData
import scrapy
class multipleImages(scrapy.Spider):
name='multipleImages'
start_urls=['http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera',]
def parse (self, response):
for url in response.css('div.products-grid div.grid-product):
yield {
ImgData(image_urls=[url.css('img::attr(src)').extract()])
}
Run Code Online (Sandbox Code Playgroud)
和items.py:
import scrapy
from scrapy.item import Item
class ShopcluesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class ImgData(Item):
image_urls=scrapy.Field()
images=scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
但是我在运行蜘蛛时遇到以下错误:
2016-09-29 11:56:19 [scrapy] DEBUG: Crawled (200) <GET http://www.shopclues.com/robots.txt> (referer: None)
2016-09-29 11:56:20 [scrapy] DEBUG: Crawled (200) <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera> (referer: None)
2016-09-29 11:56:20 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'set' in <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera>
2016-09-29 11:56:20 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'set' in <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera>
2016-09-29 11:56:20 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'set' in <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera>
2016-09-29 11:56:20 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'set' in <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera>
2016-09-29 11:56:20 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'set' in <GET http://www.shopclues.com/electronic-accessories-8/cameras-18/cameras-special.html?search=1&q1=camera>
Run Code Online (Sandbox Code Playgroud)
这个错误意味着什么?可能是错误的原因是什么?
将URL列表传递给管道.
def parse (self, response):
images = ImgData()
images['image_urls']=[]
for url in response.css('div.products-grid div.grid-product'):
images['image_urls'].append(url.css('img::attr(src)').extract_first())
yield images
Run Code Online (Sandbox Code Playgroud)
{}
是在python或字典中定义集合的符号。取决于您在大括号内提供的值。如果它是列表{a,b,c,d} <-这是一个集合,则它是值{a:b,c:d} <-这是一个决定的关键。
您在此行产生一个集合:
yield {
ImgData(image_urls=[url.css('img::attr(src)').extract()])
}
Run Code Online (Sandbox Code Playgroud)
我假设您想制作字典?
yield {
'images': ImgData(image_urls=[url.css('img::attr(src)').extract()]),
}
Run Code Online (Sandbox Code Playgroud)