AWS Lambda (Python) 无法在 S3 中解压缩和存储文件

K.P*_*Pil 5 python amazon-s3 amazon-web-services aws-lambda

Project currently maintains S3 bucket which holds a large zip size 1.5 GB containing .xpt and .sas7dbat files. Unzipped file size is 20 GB.

Trying to unzip file and push the same folder structure to S3

Following code works for a small zip files but fails for large Zip file (1.5GB) :

for obj in bucket.objects.all():
    #file_name = os.path.abspath(obj.key) # get full path of files
    key = urlparse(obj.key.encode('utf8'))
    obj = client.get_object(Bucket='my-zip-bucket', Key=obj.key)

    with io.BytesIO(obj["Body"].read()) as tf:
        # rewind the file
        tf.seek(0)

        with zipfile.ZipFile(tf, mode='r') as zipf:
            for file in zipf.infolist():
                fileName = file.filename
                putFile = client.put_object(Bucket='my-un-zip-bucket-', Key=fileName, Body=zipf.read(file))
                putObjects.append(putFile)
Run Code Online (Sandbox Code Playgroud)

Error : Memory Size: 3008 MB Max Memory Used: 3008 MB

I would like to validate :

  1. AWS-Lambda is not a suitable solution for large files ?
  2. Should I use different libraries / approach rather than reading everything in memory

Gan*_*rfz 18

有一个使用 AWS Glue 的无服务器解决方案!(我差点死了弄清楚这一点)

该解决方案分为两部分:

  1. 在上传 ZIP 文件时由 S3 触发并创建 GlueJobRun 的 lambda 函数 - 将 S3 对象键作为参数传递给 Glue。
  2. 解压缩文件(在内存中!)并上传回 S3 的胶水作业。

请参阅下面的代码,该代码解压缩 ZIP 文件并将内容放回同一个存储桶中(可配置)。

如果有帮助,请点赞 :)

调用名为 YourGlueJob 的 Glue 作业的 Lambda 脚本 (python3)

import boto3
import urllib.parse

glue = boto3.client('glue')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    print(key)    
try:
    newJobRun = glue.start_job_run(
        JobName = 'YourGlueJob',
        Arguments = {
            '--bucket':bucket,
            '--key':key,
        }
        )
    print("Successfully created unzip job")    
    return key  
except Exception as e:
    print(e)
    print('Error starting unzip job for' + key)
    raise e         
Run Code Online (Sandbox Code Playgroud)

用于解压缩文件的 AWS Glue 作业脚本

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import boto3
import zipfile
import io
from contextlib import closing

s3 = boto3.client('s3')
s3r = boto3.resource('s3')

bucket = args["bucket"]
key = args["key"]

obj = s3r.Object(
    bucket_name=bucket,
    key=key
)

buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
    print(filerr)
    y=z.open(filerr)
    arcname = key + filerr
    x = io.BytesIO(y.read())
    s3.upload_fileobj(x, bucket, arcname)
    y.close()
print(list)


job.commit()
Run Code Online (Sandbox Code Playgroud)