使用 python 在 AWS S3 存储桶中搜索特定文件

san*_*mar 3 python amazon-s3 boto botocore boto3

我有 AWS S3 访问权限,并且存储桶内有近 300 个文件。我需要通过模式匹配或搜索从该存储桶下载单个文件,因为我不知道确切的文件名(假设文件以 .csv 格式结尾)。
这是我的示例代码,显示了存储桶内的所有文件

def s3connection(credentialsdict):
    """
    :param access_key: Access key for AWS to establish S3 connection
    :param secret_key: Secret key for AWS to establish S3 connection
    :param file_name: file name of the billing file(csv file)
    :param bucket_name: Name of the bucket which consists of billing files
    :return: status, billing_bucket, billing_key
    """
    os.environ['S3_USE_SIGV4'] = 'True'
    conn = S3Connection(credentialsdict["access_key"], credentialsdict["secret_key"], host='s3.amazonaws.com')
    billing_bucket = conn.get_bucket(credentialsdict["bucket_name"], validate=False)
    try:
        billing_bucket.get_location()
    except S3ResponseError as e:
        if e.status == 400 and e.error_code == 'AuthorizationHeaderMalformed':
            conn.auth_region_name = ET.fromstring(e.body).find('./Region').text
    billing_bucket = conn.get_bucket(credentialsdict["bucket_name"])
    print billing_bucket

    if not billing_bucket:
        raise Exception("Please Enter valid bucket name. Bucket %s does not exist"
                        % credentialsdict.get("bucket_name"))
    for key in billing_bucket.list():
        print key.name
    del os.environ['S3_USE_SIGV4']
Run Code Online (Sandbox Code Playgroud)

我可以传递搜索字符串来检索完全匹配的文件名吗?

moj*_*eto 5

您可以使用 JMESPath 表达式来搜索和过滤 S3 文件。为此,您需要获取 s3 paginator list_objects_v2

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket="your_bucket_name")
Run Code Online (Sandbox Code Playgroud)

现在您已经有了迭代器,您可以使用 JMESPath 搜索。最有用的是contains - 进行%like%查询

objects = page_iterator.search("Contents[?contains(Key, `partial-file-name`)][]")
Run Code Online (Sandbox Code Playgroud)

但在你的情况下(要找到所有以结尾的文件,.csv最好使用ends_with - 进行*.csv查询

objects = page_iterator.search("Contents[?ends_with(Key, `.csv`)][]")
Run Code Online (Sandbox Code Playgroud)

然后你可以通过以下方式获取对象键

for item in objects:
    print(item['Key'])
Run Code Online (Sandbox Code Playgroud)

这个答案基于https://blog.jeffbryner.com/2020/04/21/jupyter-pandas-analysis.html/sf/answers/1909249821/