从boto3中检索S3存储桶中的子文件夹名称

mar*_*tin 52 python amazon-s3 amazon-web-services boto3

使用boto3,我可以访问我的AWS S3存储桶:

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
Run Code Online (Sandbox Code Playgroud)

现在,存储桶包含文件夹first-level,该文件夹本身包含多个以时间戳命名的子文件夹1456753904534.我需要知道我正在做的另一个工作的这些子文件夹的名称,我想知道我是否可以让boto3为我检索这些.

所以我尝试过:

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')
Run Code Online (Sandbox Code Playgroud)

它提供了一个字典,其中的"内容"键为我提供了所有第三级文件而不是第二级时间戳目录,实际上我得到一个包含所有内容的列表

{u'ETag':'"etag"',u'Key':first-level/1456753904534/part-00014',u'LastModified':datetime.datetime(2016,2,29,13,52,24,tzinfo = tzutc()),
u'Owner':{u'DisplayName':'owner',u'ID':'id'},
u'Size':size,u'StorageClass':'storageclass'}

您可以看到在这种情况下part-00014检索特定文件,而我想单独获取目录的名称.原则上我可以从所有路径中删除目录名称,但是在第三级检索所有内容以获得第二级别是丑陋且昂贵的!

我也试过这里报道的事情:

for o in bucket.objects.filter(Delimiter='/'):
    print(o.key)
Run Code Online (Sandbox Code Playgroud)

但我没有得到所需级别的文件夹.

有办法解决这个问题吗?

小智 74

下面的代码只返回s3存储桶中''文件夹'中的'子文件夹'.

import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'  

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
    print 'sub folder : ', o.get('Prefix')
Run Code Online (Sandbox Code Playgroud)

有关更多详细信息,请参阅https://github.com/boto/boto3/issues/134

  • 如果有超过 1000 个不同的前缀怎么办? (16认同)
  • 如果我想列出特定子文件夹的内容怎么办? (7认同)
  • @azhar22k,我假设您可以为每个“子文件夹”递归运行该函数。 (2认同)

azh*_*22k 28

我花了很多时间来弄清楚,但最后这里是一个使用boto3列出S3存储桶中子文件夹内容的简单方法.希望能帮助到你

prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
     print('{0}:{1}'.format(bucket.name, obj.key))
     FilesNotFound = False
if FilesNotFound:
     print("ALERT", "No file in {0}/{1}".format(bucket, prefix))
Run Code Online (Sandbox Code Playgroud)

  • 我的观点是,这是一个极其低效的解决方案。S3用于处理键中的任意分隔符。例如,“ /”。这样一来,您就可以跳过充满对象的“文件夹”,而不必对它们进行分页。然后,即使您坚持要列出完整的列表(即aws cli中的“递归”等效项),也必须使用分页器,否则您将仅列出前1000个对象。 (3认同)
  • 如果您的文件夹包含大量对象怎么办? (2认同)

Pie*_*e D 21

简短回答:

  • 使用Delimiter='/'.这样可以避免对存储桶进行递归列表.这里的一些答案错误地建议进行完整列表并使用一些字符串操作来检索目录名称.这可能是非常低效的.请记住,S3对存储桶可以包含的对象数量几乎没有限制.所以,想象一下,在bar/和之间foo/,你有一万亿个物体:你会等很长时间才能得到它['bar/', 'foo/'].

  • 使用Paginators.出于同样的原因(S3是工程师对无穷大的近似),您必须列出页面并避免将所有列表存储在内存中.相反,将您的"lister"视为迭代器,并处理它生成的流.

  • 使用boto3.client,而不是boto3.resource.该resource版本似乎没有很好地处理该Delimiter选项.如果您有资源,比如说a bucket = boto3.resource('s3').Bucket(name),您可以通过以下方式获得相应的客户:bucket.meta.client.

答案很长:

以下是我用于简单存储桶的迭代器(没有版本处理).

import boto3
from collections import namedtuple
from operator import attrgetter


S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag'])


def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True,
           list_objs=True, limit=None):
    """
    Iterator that lists a bucket's objects under path, (optionally) starting with
    start and ending before end.

    If recursive is False, then list only the "depth=0" items (dirs and objects).

    If recursive is True, then list recursively all objects (no dirs).

    Args:
        bucket:
            a boto3.resource('s3').Bucket().
        path:
            a directory in the bucket.
        start:
            optional: start key, inclusive (may be a relative path under path, or
            absolute in the bucket)
        end:
            optional: stop key, exclusive (may be a relative path under path, or
            absolute in the bucket)
        recursive:
            optional, default True. If True, lists only objects. If False, lists
            only depth 0 "directories" and objects.
        list_dirs:
            optional, default True. Has no effect in recursive listing. On
            non-recursive listing, if False, then directories are omitted.
        list_objs:
            optional, default True. If False, then directories are omitted.
        limit:
            optional. If specified, then lists at most this many items.

    Returns:
        an iterator of S3Obj.

    Examples:
        # set up
        >>> s3 = boto3.resource('s3')
        ... bucket = s3.Bucket(name)

        # iterate through all S3 objects under some dir
        >>> for p in s3ls(bucket, 'some/dir'):
        ...     print(p)

        # iterate through up to 20 S3 objects under some dir, starting with foo_0010
        >>> for p in s3ls(bucket, 'some/dir', limit=20, start='foo_0010'):
        ...     print(p)

        # non-recursive listing under some dir:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False):
        ...     print(p)

        # non-recursive listing under some dir, listing only dirs:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False, list_objs=False):
        ...     print(p)
"""
    kwargs = dict()
    if start is not None:
        if not start.startswith(path):
            start = os.path.join(path, start)
        # note: need to use a string just smaller than start, because
        # the list_object API specifies that start is excluded (the first
        # result is *after* start).
        kwargs.update(Marker=__prev_str(start))
    if end is not None:
        if not end.startswith(path):
            end = os.path.join(path, end)
    if not recursive:
        kwargs.update(Delimiter='/')
        if not path.endswith('/'):
            path += '/'
    kwargs.update(Prefix=path)
    if limit is not None:
        kwargs.update(PaginationConfig={'MaxItems': limit})

    paginator = bucket.meta.client.get_paginator('list_objects')
    for resp in paginator.paginate(Bucket=bucket.name, **kwargs):
        q = []
        if 'CommonPrefixes' in resp and list_dirs:
            q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']]
        if 'Contents' in resp and list_objs:
            q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']]
        # note: even with sorted lists, it is faster to sort(a+b)
        # than heapq.merge(a, b) at least up to 10K elements in each list
        q = sorted(q, key=attrgetter('key'))
        if limit is not None:
            q = q[:limit]
            limit -= len(q)
        for p in q:
            if end is not None and p.key >= end:
                return
            yield p


def __prev_str(s):
    if len(s) == 0:
        return s
    s, c = s[:-1], ord(s[-1])
    if c > 0:
        s += chr(c - 1)
    s += ''.join(['\u7FFF' for _ in range(10)])
    return s
Run Code Online (Sandbox Code Playgroud)

测试:

以下是有助于测试的行为paginatorlist_objects.它创建了许多目录和文件.由于页面最多为1000个条目,因此我们使用dirs和文件的倍数.dirs仅包含目录(每个目录都有一个对象).mixed包含dirs和objects的混合,每个目录的比率为2个对象(当然,在dir下加一个对象; S3只存储对象).

import concurrent
def genkeys(top='tmp/test', n=2000):
    for k in range(n):
        if k % 100 == 0:
            print(k)
        for name in [
            os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_a'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_b'),
        ]:
            yield name


with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
    executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())
Run Code Online (Sandbox Code Playgroud)

得到的结构是:

./dirs/0000_dir/foo
./dirs/0001_dir/foo
./dirs/0002_dir/foo
...
./dirs/1999_dir/foo
./mixed/0000_dir/foo
./mixed/0000_foo_a
./mixed/0000_foo_b
./mixed/0001_dir/foo
./mixed/0001_foo_a
./mixed/0001_foo_b
./mixed/0002_dir/foo
./mixed/0002_foo_a
./mixed/0002_foo_b
...
./mixed/1999_dir/foo
./mixed/1999_foo_a
./mixed/1999_foo_b
Run Code Online (Sandbox Code Playgroud)

通过对上面给出的代码进行一些篡改s3list来检查来自的响应paginator,您可以观察到一些有趣的事实:

  • Marker是真的排斥.给定Marker=topdir + 'mixed/0500_foo_a'将使该列表该键之后开始(根据AmazonS3 API),即使用.../mixed/0500_foo_b.这就是原因__prev_str().

  • Delimiter列出时mixed/,使用paginator包含666个密钥和334个公共前缀的每个响应.它非常擅长建立巨大的反应.

  • 相比之下,当列出时dirs/,每个响应都paginator包含1000个公共前缀(并且没有键).

  • 以限制的形式传递PaginationConfig={'MaxItems': limit}限制仅限于键的数量,而不是公共前缀.我们通过进一步截断迭代器的流来处理它.

  • 向您的代码添加修复程序。如果有人想要非递归地列出存储桶中的所有目录,他们会发送以下内容:`s3list(bucket, '', recursive=False, list_objs=False)`,所以我添加了`and len(path) > 0: ` 到 `如果不是 path.endswith('/')` (2认同)

moo*_*oot 20

S3是对象存储,它没有真正的目录结构."/"相当美观.人们希望拥有目录结构的一个原因,因为他们可以维护/修剪/添加树到应用程序.对于S3,您将此类结构视为索引或搜索标记的排序.

要在S3中操作对象,需要boto3.client或boto3.resource,例如列出所有对象

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name') 
Run Code Online (Sandbox Code Playgroud)

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

关于boto3:boto3.resource的提醒是一个不错的高级API.使用boto3.client与boto3.resource有利有弊.如果您开发内部共享库,使用boto3.resource将为您提供所使用资源的黑盒层.

  • 这给了我在问题中尝试得到的相同结果。我想我将不得不通过从返回的对象中获取所有键并拆分字符串以获取文件夹名称来解决困难的方法。 (3认同)
  • @martina:一个懒惰的python拆分并获取列表中的最后一个数据,例如 filename = keyname.split("/")[-1] (2认同)
  • @martin `directory_name = os.path.dirname(directory/path/and/filename.txt)` 和 `file_name = os.path.basename(directory/path/and/filename.txt)` (2认同)

CpI*_*ILL 14

S3的一个重要实现是没有文件夹/目录只是键.该表观文件夹结构只是前置到文件名,成为"关键",所以列表的内容myBucketsome/path/to/the/file/,你可以试试:

s3 = boto3.client('s3')
for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']:
    print(obj['Key'])
Run Code Online (Sandbox Code Playgroud)

这会给你类似的东西:

some/path/to/the/file/yo.jpg
some/path/to/the/file/meAndYou.gif
...
Run Code Online (Sandbox Code Playgroud)


小智 13

我有同样的问题,但设法使用boto3.clientlist_objects_v2with BucketStartAfter参数解决它.

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']
Run Code Online (Sandbox Code Playgroud)

上面代码的输出结果将显示以下内容:

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2
Run Code Online (Sandbox Code Playgroud)

Boto3 list_objects_v2文档

为了只删除secondLevelFolder我刚使用的python方法的目录名split():

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key'].encode("string_escape").split('/')
    print direcoryName[1]
Run Code Online (Sandbox Code Playgroud)

上面代码的输出结果将显示以下内容:

secondLevelFolder
secondLevelFolder
Run Code Online (Sandbox Code Playgroud)

Python split()文档

如果您想获取目录名称和内容项目名称,请使用以下内容替换打印行:

print "{}/{}".format(fileName[1], fileName[2])
Run Code Online (Sandbox Code Playgroud)

以下将输出:

secondLevelFolder/item2
secondLevelFolder/item2
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助


小智 8

当您运行时aws s3 ls s3://my-bucket/,AWS cli 会执行此操作(大概不会获取和迭代存储桶中的所有键),因此我认为必须有一种使用 boto3 的方法。

https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499

看起来他们确实使用了 Prefix 和 Delimiter - 我能够编写一个函数,通过稍微修改该代码来获取存储桶根级别的所有目录:

def list_folders_in_bucket(bucket):
    paginator = boto3.client('s3').get_paginator('list_objects')
    folders = []
    iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None})
    for response_data in iterator:
        prefixes = response_data.get('CommonPrefixes', [])
        for prefix in prefixes:
            prefix_name = prefix['Prefix']
            if prefix_name.endswith('/'):
                folders.append(prefix_name.rstrip('/'))
    return folders
Run Code Online (Sandbox Code Playgroud)


cem*_*cem 6

以下对我有用... S3对象:

s3://bucket/
    form1/
       section11/
          file111
          file112
       section12/
          file121
    form2/
       section21/
          file211
          file112
       section22/
          file221
          file222
          ...
      ...
   ...
Run Code Online (Sandbox Code Playgroud)

使用方法:

from boto3.session import Session
s3client = session.client('s3')
resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/")
forms = [x['Prefix'] for x in resp['CommonPrefixes']] 
Run Code Online (Sandbox Code Playgroud)

我们得到:

form1/
form2/
...
Run Code Online (Sandbox Code Playgroud)

带有:

resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/")
sections = [x['Prefix'] for x in resp['CommonPrefixes']] 
Run Code Online (Sandbox Code Playgroud)

我们得到:

form1/section11/
form1/section12/
Run Code Online (Sandbox Code Playgroud)


Acu*_*nus 6

为什么不使用s3path使它像使用一样方便的包pathlib?但是,如果您必须使用boto3

使用 boto3.resource

这建立在itz-azhar回答之上,以应用可选的limit. 它的使用显然比boto3.client版本简单得多。

import logging
from typing import List, Optional

import boto3
from boto3_type_annotations.s3 import ObjectSummary  # pip install boto3_type_annotations

log = logging.getLogger(__name__)
_S3_RESOURCE = boto3.resource("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    return list(_S3_RESOURCE.Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix))


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)
Run Code Online (Sandbox Code Playgroud)

使用 boto3.client

这使用list_objects_v2并建立在CpILL答案之上,以允许检索 1000 多个对象。

import logging
from typing import cast, List

import boto3

log = logging.getLogger(__name__)
_S3_CLIENT = boto3.client("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: int = cast(int, float("inf"))) -> List[dict]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    contents: List[dict] = []
    continuation_token = None
    if limit <= 0:
        return contents
    while True:
        max_keys = min(1000, limit - len(contents))
        request_kwargs = {"Bucket": bucket_name, "Prefix": prefix, "MaxKeys": max_keys}
        if continuation_token:
            log.info(  # type: ignore
                "Listing %s objects in s3://%s/%s using continuation token ending with %s with %s objects listed thus far.",
                max_keys, bucket_name, prefix, continuation_token[-6:], len(contents))  # pylint: disable=unsubscriptable-object
            response = _S3_CLIENT.list_objects_v2(**request_kwargs, ContinuationToken=continuation_token)
        else:
            log.info("Listing %s objects in s3://%s/%s with %s objects listed thus far.", max_keys, bucket_name, prefix, len(contents))
            response = _S3_CLIENT.list_objects_v2(**request_kwargs)
        assert response["ResponseMetadata"]["HTTPStatusCode"] == 200
        contents.extend(response["Contents"])
        is_truncated = response["IsTruncated"]
        if (not is_truncated) or (len(contents) >= limit):
            break
        continuation_token = response["NextContinuationToken"]
    assert len(contents) <= limit
    log.info("Returning %s objects from s3://%s/%s.", len(contents), bucket_name, prefix)
    return contents


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)
Run Code Online (Sandbox Code Playgroud)