Kur*_*eek 2 python amazon-s3 scrapy
我之前问过一个类似的问题(Scrapy 如何避免重新下载最近下载的媒体?),但由于我没有收到明确的答案,所以我会再问一次。
我已经使用 Scrapy 的文件管道将大量文件下载到 AWS S3 存储桶中。根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),此管道避免“重新下载已最近下载”,但它没有说明“最近”是多久以前或如何设置此参数。
FilesPipeline
在https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py查看该类的实现,这似乎是从FILES_EXPIRES
设置中获得的,默认值为 90 天:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download
Run Code Online (Sandbox Code Playgroud)
我理解正确吗?另外,我没有看到有类似的布尔声明age_days
的S3FilesStore
类; 是否也对 S3 上的文件实施了年龄检查?(我也找不到任何测试来测试 S3 的这个年龄检查功能)。
FILES_EXPIRES
确实是在下载文件(再次)之前告诉 FilesPipeline 文件有多“旧”的设置。
代码的关键部分在media_to_download
:_onsuccess
回调检查管道self.store.stat_file
调用的结果,对于您的问题,它特别查找“last_modified”信息。如果最后修改时间早于“过期天数”,则触发下载。
您可以查看S3store 如何获取“上次修改”信息。这取决于 botocore 是否可用。
归档时间: |
|
查看次数: |
529 次 |
最近记录: |