Kur*_*eek 9 python amazon-s3 amazon-web-services boto3
我有一个存储在S3存储桶中的大量文件(> 1,000),我想迭代它们(例如在for循环中)以使用它们从中提取数据boto3.
但是,我注意到,根据http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects,该类的list_objects()方法Client仅列出最多1,000个对象:
In [1]: import boto3
In [2]: client = boto3.client('s3')
In [11]: apks = client.list_objects(Bucket='iper-apks')
In [16]: type(apks['Contents'])
Out[16]: list
In [17]: len(apks['Contents'])
Out[17]: 1000
Run Code Online (Sandbox Code Playgroud)
但是,我想列出所有对象,即使有超过1,000个.我怎么能实现这个目标?
Joh*_*ter 14
正如kurt-peek所说,boto3有一个Paginator类,它允许你对s3对象的页面进行迭代,并且可以很容易地用来为页面中的项提供迭代器:
import boto3
def iterate_bucket_items(bucket):
"""
Generator that iterates over all objects in a given s3 bucket
See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2
for return data format
:param bucket: name of s3 bucket
:return: dict of metadata for an object
"""
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket)
for page in page_iterator:
if page['KeyCount'] > 0:
for item in page['Contents']:
yield item
for i in iterate_bucket_items(bucket='my_bucket'):
print i
Run Code Online (Sandbox Code Playgroud)
这将输出如下内容:
{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
u'Size': 242,
u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
u'Size': 238,
u'StorageClass': 'STANDARD'}
...
Run Code Online (Sandbox Code Playgroud)
请注意,list_objects_v2建议使用而不是list_objects:https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
您也可以通过调用做到这一点,在一个较低的水平list_objects_v2(),直接和通过的NextContinuationToken从响应值ContinuationToken,同时isTruncated在响应如此.
| 归档时间: |
|
| 查看次数: |
9499 次 |
| 最近记录: |