使用 Amazon Textract 时不支持的文档格式，

Question

使用 Amazon Textract 时不支持的文档格式，

Jun*_*apa 10 python python-3.x amazon-textract

当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，Request has unsupported document format。

我正在使用带有 boto3 的 Amazon textract。当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，Request has unsupported do cument format。我对此很陌生，在 textract 的文档中提到确实支持 pdf 文件。

这是我正在使用的代码。

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

Run Code Online (Sandbox Code Playgroud)

这给了我错误，Request has unsupported document format。

Answer 1

小智 17

detect_document_text() 是一个同步 API，只支持 PNG 或 JPG 图片。

如果您想处理 PDF 文件，您应该使用名为 start_document_text_detection() 的异步 API。

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

如果文档只有一页长，使用 detector_document_text() 没有问题。我问他，如果pdf有几页，你知道如何指定它只适用于一页吗？ (3认同)

归档时间：	6 年，3 月前
查看次数：	6006 次
最近记录：	5 年前