Ale*_*ung 5 pdf-generation amazon-web-services aws-lambda
尝试过:
可能性:
-当我从 s3 获取文档文件时,在浏览器 JS 中将其转换为 PDF。- 以某种方式修复部署包中的 comtypes 或 win32com。正在使用Python 3.6。
import json
import urllib
import boto3
from boto3.s3.transfer import TransferConfig
from botocore.exceptions import ClientError
import lxml
import comtypes.client
import io
import os
import sys
import threading
from docx import Document
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
response = s3.get_object(Bucket=bucket, Key=key)
# Creating the Document
f = io.BytesIO(response['Body'].read())
document = Document(f)
//Code for formating my document object in this hidden section.
document.save('/tmp/'+key)
pdfkey = key.split(".")[0]+".pdf"
//The following function is suppose to convert my doc to pdf
doctopdf('/tmp/'+ key,'/tmp/'+pdfkey)
//PDF file is then saved to s3
s3.upload_file('/tmp/'+pdfkey,'output',pdfkey)
except exceptions as e:
Logging.error(e)
raise e
def doctopdf(in_file,out_file):
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
Run Code Online (Sandbox Code Playgroud)
我还遇到过将 Word 文档 (doc/docx) 转换为 PDF 或任何其他文档类型的问题。我通过 LibreOffice 和 Python 3.8(也适用于 python 3.6 和 3.7)使用 AWS Lambda 中的子进程解决了这个问题。
基本上,此设置将通过输入事件从 S3 中选取文件,并将文件转换为 PDF,并将转换后的文件放入相同的 S3 位置。让我们浏览一下设置指南。
对于此设置,我们需要可通过 Lambda 访问的 LibreOffice 可执行文件。为了实现这一目标,我们将使用 Lambda 层。现在,您有两个选择:
是时候创建 Lambda(依赖包)了。
fonts/fonts.conf以下内容(假设 libreoffice 将在 /tmp/instdir 目录下提取):<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<dir>/tmp/instdir/share/fonts/truetype</dir>
<cachedir>/tmp/fonts-cache/</cachedir>
<config></config>
</fontconfig>
Run Code Online (Sandbox Code Playgroud)
lambda_function.py文件中:import os
from io import BytesIO
import tarfile
import boto3
import subprocess
import brotli
libre_office_install_dir = '/tmp/instdir'
def load_libre_office():
if os.path.exists(libre_office_install_dir) and os.path.isdir(libre_office_install_dir):
print('We have a cached copy of LibreOffice, skipping extraction')
else:
print('No cached copy of LibreOffice exists, extracting tar stream from Brotli file.')
buffer = BytesIO()
with open('/opt/lo.tar.br', 'rb') as brotli_file:
decompressor = brotli.Decompressor()
while True:
chunk = brotli_file.read(1024)
buffer.write(decompressor.decompress(chunk))
if len(chunk) < 1024:
break
buffer.seek(0)
print('Extracting tar stream to /tmp for caching.')
with tarfile.open(fileobj=buffer) as tar:
tar.extractall('/tmp')
print('Done caching LibreOffice!')
return f'{libre_office_install_dir}/program/soffice.bin'
def download_from_s3(bucket, key, download_path):
s3 = boto3.client("s3")
s3.download_file(bucket, key, download_path)
def upload_to_s3(file_path, bucket, key):
s3 = boto3.client("s3")
s3.upload_file(file_path, bucket, key)
def convert_word_to_pdf(soffice_path, word_file_path, output_dir):
conv_cmd = f"{soffice_path} --headless --norestore --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --convert-to pdf:writer_pdf_Export --outdir {output_dir} {word_file_path}"
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
return False
return True
def lambda_handler(event, context):
bucket = event["document_bucket"]
key = event["document_key"]
key_prefix, base_name = os.path.split(key)
download_path = f"/tmp/{base_name}"
output_dir = "/tmp"
download_from_s3(bucket, key, download_path)
soffice_path = load_libre_office()
is_converted = convert_word_to_pdf(soffice_path, download_path, output_dir)
if is_converted:
file_name, _ = os.path.splitext(base_name)
upload_to_s3(f"{output_dir}/{file_name}.pdf", bucket, f"{key_prefix}/{file_name}.pdf")
return {"response": "file converted to PDF and available at same S3 location of input key"}
else:
return {"response": "cannot convert this document to PDF"}
Run Code Online (Sandbox Code Playgroud)
site-packages/brotli从 Linux 环境构建(并在安装后从 Linux 环境复制)brotlipy依赖项,因为目标 Lambda 运行时是 AmazonLinux。最后,您的 lambda(依赖包)的目录结构应如下所示:
.
+-- brotli/*
+-- fonts
| +-- fonts.conf
+-- lambda_function.py
Run Code Online (Sandbox Code Playgroud)
如果您的文件 s3 URI 是,您可以使用以下输入事件来调用此 Lambda 处理程序s3://my-bucket-name/dir/file.docx:
{
"document_bucket: "my-bucket-name"
"document_key": "dir/file.docx"
}
Run Code Online (Sandbox Code Playgroud)
干杯! 如果您遇到任何问题,请告诉我,我们很乐意为您提供帮助:)
| 归档时间: |
|
| 查看次数: |
6144 次 |
| 最近记录: |