K.P*_*Pil 5 python amazon-s3 amazon-web-services aws-lambda
Project currently maintains S3 bucket which holds a large zip size 1.5 GB containing .xpt and .sas7dbat files. Unzipped file size is 20 GB.
Trying to unzip file and push the same folder structure to S3
Following code works for a small zip files but fails for large Zip file (1.5GB) :
for obj in bucket.objects.all():
#file_name = os.path.abspath(obj.key) # get full path of files
key = urlparse(obj.key.encode('utf8'))
obj = client.get_object(Bucket='my-zip-bucket', Key=obj.key)
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
for file in zipf.infolist():
fileName = file.filename
putFile = client.put_object(Bucket='my-un-zip-bucket-', Key=fileName, Body=zipf.read(file))
putObjects.append(putFile)
Run Code Online (Sandbox Code Playgroud)
Error : Memory Size: 3008 MB Max Memory Used: 3008 MB
I would like to validate :
Gan*_*rfz 18
有一个使用 AWS Glue 的无服务器解决方案!(我差点死了弄清楚这一点)
该解决方案分为两部分:
请参阅下面的代码,该代码解压缩 ZIP 文件并将内容放回同一个存储桶中(可配置)。
如果有帮助,请点赞 :)
调用名为 YourGlueJob 的 Glue 作业的 Lambda 脚本 (python3)
import boto3
import urllib.parse
glue = boto3.client('glue')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
print(key)
try:
newJobRun = glue.start_job_run(
JobName = 'YourGlueJob',
Arguments = {
'--bucket':bucket,
'--key':key,
}
)
print("Successfully created unzip job")
return key
except Exception as e:
print(e)
print('Error starting unzip job for' + key)
raise e
Run Code Online (Sandbox Code Playgroud)
用于解压缩文件的 AWS Glue 作业脚本
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
import boto3
import zipfile
import io
from contextlib import closing
s3 = boto3.client('s3')
s3r = boto3.resource('s3')
bucket = args["bucket"]
key = args["key"]
obj = s3r.Object(
bucket_name=bucket,
key=key
)
buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
print(filerr)
y=z.open(filerr)
arcname = key + filerr
x = io.BytesIO(y.read())
s3.upload_fileobj(x, bucket, arcname)
y.close()
print(list)
job.commit()
Run Code Online (Sandbox Code Playgroud)