对于我的scrapy项目,我目前正在使用ImagesPipeline.下载的图像以其URL 的SHA1哈希值存储为文件名.
如何使用我自己的自定义文件名来存储文件?
如果我的自定义文件名需要包含同一项中的另一个已删除字段,该怎么办?例如,使用item['desc']图像和文件名item['image_url'].如果我理解正确,那将涉及以某种方式访问图像管道中的其他项目字段.
任何帮助将不胜感激.
sum*_*mid 16
这只是对scrapy 0.24(EDITED)的答案的实现,其中image_key()不推荐使用
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
#item=request.meta['item'] # Like this you can use all from item, not just url.
image_guid = request.url.split('/')[-1]
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + response.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
for image in item['images']:
yield Request(image)
Run Code Online (Sandbox Code Playgroud)
小智 12
在scrapy 0.12中我解决了这样的问题
class MyImagesPipeline(ImagesPipeline):
#Name download version
def image_key(self, url):
image_guid = url.split('/')[-1]
return 'full/%s.jpg' % (image_guid)
#Name thumbnail version
def thumb_key(self, url, thumb_id):
image_guid = thumb_id + url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
yield Request(item['images'])
Run Code Online (Sandbox Code Playgroud)
这就是我在Scrapy 0.10中解决问题的方法.检查FSImagesStoreChangeableDirectory的方法persist_image.下载图像的文件名是关键
class FSImagesStoreChangeableDirectory(FSImagesStore):
def persist_image(self, key, image, buf, info,append_path):
absolute_path = self._get_filesystem_path(append_path+'/'+key)
self._mkdir(os.path.dirname(absolute_path), info)
image.save(absolute_path)
class ProjectPipeline(ImagesPipeline):
def __init__(self):
super(ImagesPipeline, self).__init__()
store_uri = settings.IMAGES_STORE
if not store_uri:
raise NotConfigured
self.store = FSImagesStoreChangeableDirectory(store_uri)
Run Code Online (Sandbox Code Playgroud)
我在2017年找到了自己的方式,scrapy 1.1.3
def file_path(self, request, response=None, info=None):
return request.meta.get('filename','')
def get_media_requests(self, item, info):
img_url = item['img_url']
meta = {'filename': item['name']}
yield Request(url=img_url, meta=meta)
Run Code Online (Sandbox Code Playgroud)
像上面的代码,你可以添加你想要在一个请求元名字get_media_requests(),并把它放回file_path()通过request.meta.get('yourname','').