如何将 Amazon Textract 用于 PDF 文件

Question

如何将 Amazon Textract 用于 PDF 文件

Art*_*urS 5 ocr text-extraction amazon-web-services amazon-textract

我已经可以使用文本提取但使用 JPEG 文件。我想将它与 PDF 文件一起使用。

我有以下代码：

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)

Run Code Online (Sandbox Code Playgroud)

正如我所说，它工作正常。但我想使用它传递一个 PDF 文件，就像在 Web 应用程序中一样进行测试。

我知道可以在 python 中将 PDF 转换为 JPEG，但使用 PDF 会很好。我阅读了文档并没有找到答案。

我怎样才能做到这一点？

编辑 1：我忘了提到我不打算使用 de s3 存储桶。我想在脚本中直接传递 PDF，而不必将其上传到 s3 存储桶中。

Answer 1

tyr*_*rex 7

正如@syumaK提到的，您需要首先将pdf上传到S3。然而，这样做可能比你想象的更便宜、更容易：

在控制台中创建新的S3存储桶并记下存储桶名称，然后

import random
import boto3

bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'

s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)

client = boto3.client('textract')
response = client.start_document_text_detection(
                   DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
                   ClientRequestToken=random.randint(1,1e10))

jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)

Run Code Online (Sandbox Code Playgroud)

调用返回get_document_text_detection(...)结果可能需要 5-50 秒。之前，它会说仍在处理中。

根据我的理解，对于每个令牌，将执行一次付费 API 调用 - 如果该令牌过去曾出现过，则将检索过去的 API 调用。

编辑： 我忘了提及，如果文档很大，则存在一个复杂性，在这种情况下，结果可能需要将多个“页面”拼接在一起。您需要添加的代码类型是


...
pages = [response]
while nextToken := response.get('NextToken'):
    response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
    pages.append(response)

Run Code Online (Sandbox Code Playgroud)

非常感谢您的编辑。我不知道 NextToken 的事情，并且在实现时从未遇到过它......这就是当你不阅读整个文档时发生的事情:'(过去几天我一直在谷歌搜索这个东西，至于为什么 textract 不使用 boto3 时扫描我的整个文档：3 (2认同)

Answer 2

sas*_*ash 3

如 AWS Textract 常见问题页面https://aws.amazon.com/textract/faqs/中所述。支持 pdf 文件，并且在 Sdk 中也支持https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html

示例用法https://github.com/aws-samples/amazon-texttract-code-samples/blob/master/python/12-pdf-text.py

归档时间：	6 年，2 月前
查看次数：	7329 次
最近记录：	5 年，6 月前