如何避免在 Scrapy 中将媒体重新下载到 S3?

Kur*_*eek 2 python amazon-s3 scrapy

我之前问过一个类似的问题(Scrapy 如何避免重新下载最近下载的媒体?),但由于我没有收到明确的答案,所以我会再问一次。

我已经使用 Scrapy 的文件管道将大量文件下载到 AWS S3 存储桶中。根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),此管道避免“重新下载已最近下载”,但它没有说明“最近”是多久以前或如何设置此参数。

FilesPipelinehttps://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py查看该类的实现,这似乎是从FILES_EXPIRES设置中获得的,默认值为 90 天:

class FilesPipeline(MediaPipeline):
    """Abstract pipeline that implement the file downloading
    This pipeline tries to minimize network transfers and file processing,
    doing stat of the files and determining if file is new, uptodate or
    expired.
    `new` files are those that pipeline never processed and needs to be
        downloaded from supplier site the first time.
    `uptodate` files are the ones that the pipeline processed and are still
        valid files.
    `expired` files are those that pipeline already processed but the last
        modification was made long time ago, so a reprocessing is recommended to
        refresh it in case of change.
    """

    MEDIA_NAME = "file"
    EXPIRES = 90
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
    }
    DEFAULT_FILES_URLS_FIELD = 'file_urls'
    DEFAULT_FILES_RESULT_FIELD = 'files'

    def __init__(self, store_uri, download_func=None, settings=None):
        if not store_uri:
            raise NotConfigured

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)

        cls_name = "FilesPipeline"
        self.store = self._get_store(store_uri)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name=cls_name,
                                    settings=settings)
        self.expires = settings.getint(
            resolve('FILES_EXPIRES'), self.EXPIRES
        )
        if not hasattr(self, "FILES_URLS_FIELD"):
            self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
        if not hasattr(self, "FILES_RESULT_FIELD"):
            self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
        self.files_urls_field = settings.get(
            resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
        )
        self.files_result_field = settings.get(
            resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
        )

        super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

    @classmethod
    def from_settings(cls, settings):
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
        s3store.POLICY = settings['FILES_STORE_S3_ACL']

        store_uri = settings['FILES_STORE']
        return cls(store_uri, settings=settings)

    def _get_store(self, uri):
        if os.path.isabs(uri):  # to support win32 paths like: C:\\some\dir
            scheme = 'file'
        else:
            scheme = urlparse(uri).scheme
        store_cls = self.STORE_SCHEMES[scheme]
        return store_cls(uri)

    def media_to_download(self, request, info):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download
Run Code Online (Sandbox Code Playgroud)

我理解正确吗?另外,我没有看到有类似的布尔声明age_daysS3FilesStore类; 是否也对 S3 上的文件实施了年龄检查?(我也找不到任何测试来测试 S3 的这个年龄检查功能)。

pau*_*rth 5

FILES_EXPIRES 确实是在下载文件(再次)之前告诉 FilesPipeline 文件有多“旧”的设置。

代码的关键部分在media_to_download_onsuccess回调检查管道self.store.stat_file调用的结果,对于您的问题,它特别查找“last_modified”信息。如果最后修改时间早于“过期天数”,则触发下载。

您可以查看S3store 如何获取“上次修改”信息。这取决于 botocore 是否可用。