Amazon S3连接小文件

jia*_* fu 11 concatenation amazon-s3 amazon-web-services

有没有办法在Amazon S3上连接小于5MB的小文件.由于文件较小,多部件上传不正常.

下拉所有这些文件并进行连接并不是一种有效的解决方案.

那么,任何人都可以告诉我一些API来做这些吗?

Joh*_*ein 10

Amazon S3不提供连接功能.它主要是一个对象存储服务.

您将需要一些下载对象,组合它们然后再次上传它们的过程.最有效的方法是并行下载对象,以充分利用可用带宽.但是,这对代码来说更复杂.

我建议在"在云端"进行处理,以避免必须通过Internet下载对象.在Amazon EC2或AWS Lambda上执行此操作会更高效,成本更低.

  • 旧评论,但这不完全正确.你可以在S3上有一个5MB的垃圾对象并与它连接,其中第1部分= 5MB垃圾对象,第2部分=你要连接的文件.对每个片段不断重复此操作,最后使用范围拷贝去除5MB垃圾. (7认同)
  • @wwadge哦!那是险恶的,而且“非常”酷!使用[上传零件-复制](http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html)从多个文件复制数据,就好像它们是同一文件的一部分一样。整齐! (2认同)

Kar*_*nka 8

根据@wwadge 的评论,我编写了一个Python 脚本。

它通过上传略大于 5MB 的虚拟对象来绕过 5MB 限制,然后附加每个小文件,就好像它是最后一个一样。最后,它从合并的文件中删除虚拟部分。

import boto3
import os

bucket_name = 'multipart-bucket'
merged_key = 'merged.json'
mini_file_0 = 'base_0.json'
mini_file_1 = 'base_1.json'
dummy_file = 'dummy_file'

s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')

# we need to have a garbage/dummy file with size > 5MB
# so we create and upload this
# this key will also be the key of final merged file
with open(dummy_file, 'wb') as f:
    # slightly > 5MB
    f.seek(1024 * 5200) 
    f.write(b'0')

with open(dummy_file, 'rb') as f:
    s3_client.upload_fileobj(f, bucket_name, merged_key)

os.remove(dummy_file)


# get the number of bytes of the garbage/dummy-file
# needed to strip out these garbage/dummy bytes from the final merged file
bytes_garbage = s3_resource.Object(bucket_name, merged_key).content_length

# for each small file you want to concat
# when this loop have finished merged.json will contain 
# (merged.json + base_0.json + base_2.json)
for key_mini_file in ['base_0.json','base_1.json']: # include more files if you want

    # initiate multipart upload with merged.json object as target
    mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
        
    part_responses = []
    # perform multipart copy where merged.json is the first part 
    # and the small file is the second part
    for n, copy_key in enumerate([merged_key, key_mini_file]):
        part_number = n + 1
        copy_response = s3_client.upload_part_copy(
            Bucket=bucket_name,
            CopySource={'Bucket': bucket_name, 'Key': copy_key},
            Key=merged_key,
            PartNumber=part_number,
            UploadId=mpu['UploadId']
        )

        part_responses.append(
            {'ETag':copy_response['CopyPartResult']['ETag'], 'PartNumber':part_number}
        )

    # complete the multipart upload
    # content of merged will now be merged.json + mini file
    response = s3_client.complete_multipart_upload(
        Bucket=bucket_name,
        Key=merged_key,
        MultipartUpload={'Parts': part_responses},
        UploadId=mpu['UploadId']
    )

# get the number of bytes from the final merged file
bytes_merged = s3_resource.Object(bucket_name, merged_key).content_length

# initiate a new multipart upload
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)            
# do a single copy from the merged file specifying byte range where the 
# dummy/garbage bytes are excluded
response = s3_client.upload_part_copy(
    Bucket=bucket_name,
    CopySource={'Bucket': bucket_name, 'Key': merged_key},
    Key=merged_key,
    PartNumber=1,
    UploadId=mpu['UploadId'],
    CopySourceRange='bytes={}-{}'.format(bytes_garbage, bytes_merged-1)
)
# complete the multipart upload
# after this step the merged.json will contain (base_0.json + base_2.json)
response = s3_client.complete_multipart_upload(
    Bucket=bucket_name,
    Key=merged_key,
    MultipartUpload={'Parts': [
       {'ETag':response['CopyPartResult']['ETag'], 'PartNumber':1}
    ]},
    UploadId=mpu['UploadId']
)
Run Code Online (Sandbox Code Playgroud)

如果您已经有一个 >5MB 的对象,并且还想添加较小的部分,则跳过创建虚拟文件和带有字节范围的最后一个副本部分。另外,我不知道这如何在大量非常小的文件上执行 - 在这种情况下,最好下载每个文件,在本地合并它们,然后上传。