标签: amazon-textract

使用 Amazon Textract 时不支持的文档格式，

当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，Request has unsupported document format。

我正在使用带有 boto3 的 Amazon textract。当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，Request has unsupported do cument format。我对此很陌生，在 textract 的文档中提到确实支持 pdf 文件。

这是我正在使用的代码。

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

Run Code Online (Sandbox Code Playgroud)

这给了我错误，Request has unsupported document format。

python python-3.x amazon-textract

Jun*_*apa

lucky-day

10
推荐指数

1
解决办法

6006
查看次数

AWS Textract StartDocumentAnalysis 函数未向 SNS 主题发布消息

我正在使用 AWS Textract 并且我想分析一个多页文档，因此我必须使用异步选项，所以我首先使用了startDocumentAnalysis函数，我得到了一个 JobId 作为返回，但它需要触发一个我设置为的函数当 SNS 主题收到消息时触发。

这些是我的无服务器文件和处理程序文件。

provider:
  name: aws
  runtime: nodejs8.10
  stage: dev
  region: us-east-1
  iamRoleStatements:
    - Effect: "Allow"
      Action:
       - "s3:*"
      Resource: { "Fn::Join": ["", ["arn:aws:s3:::${self:custom.secrets.IMAGE_BUCKET_NAME}", "/*" ] ] }
    - Effect: "Allow"
      Action:
        - "sts:AssumeRole"
        - "SNS:Publish"
        - "lambda:InvokeFunction"
        - "textract:DetectDocumentText"
        - "textract:AnalyzeDocument"
        - "textract:StartDocumentAnalysis"
        - "textract:GetDocumentAnalysis"
      Resource: "*"

custom:
  secrets: ${file(secrets.${opt:stage, self:provider.stage}.yml)}

functions:
  routes:
    handler: src/functions/routes/handler.run
    events:
      - s3:
          bucket: ${self:custom.secrets.IMAGE_BUCKET_NAME}
          event: s3:ObjectCreated:*

  textract:
    handler: src/functions/routes/handler.detectTextAnalysis
    events:
      - sns: "TextractTopic"

resources:
  Resources: …

Run Code Online (Sandbox Code Playgroud)

amazon-web-services aws-sdk aws-lambda aws-sdk-nodejs amazon-textract

gok*_*ack

2019 07-04

7
推荐指数

3
解决办法

1868
查看次数

AWS Textract - UnsupportedDocumentException - PDF

我正在使用 boto3（适用于 python 的 aws sdk）来分析文档（pdf）以获取表单键：值对。

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

Run Code Online (Sandbox Code Playgroud)

我已使用分析文档遵循 AWS 文档，当我运行我的函数时，我收到错误。

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Run Code Online (Sandbox Code Playgroud)

我错过了什么吗？

python amazon-web-services boto3 amazon-textract

gmw*_*934

2021 06-19

6
推荐指数

1
解决办法

7066
查看次数

InvalidS3ObjectException：无法从 S3 获取对象元数据？

因此，我尝试使用Amazon Textract读取多个 pdf 文件，其中多个页面使用StartDocumentTextDetection以下方法：

client = boto3.client('textract')
textract_bucket = s3.Bucket('my_textract_console-us-east-2')

for s3_file in textract_bucket.objects.all():
    print(s3_file)

    response = client.start_document_text_detection(
        DocumentLocation = {
                "S3Object": {
                    "Bucket": "my_textract_console_us-east-2",
                    "Name": s3_file.key,
                    
                } 
        },
        ClientRequestToken=str(random.randint(1,1e10)))
    print(response)
    break

Run Code Online (Sandbox Code Playgroud)

当只是尝试从检索响应对象时s3，我可以看到它打印出来为：

s3.ObjectSummary(bucket_name='my_textract_console-us-east-2', key='C:\\Users\\My_User\\Documents\\Folder\\Sub_Folder\\Sub_sub_folder\\filename.PDF')

Run Code Online (Sandbox Code Playgroud)

相应地，我s3_file.key稍后将使用它来访问该对象。但我收到以下我无法弄清楚的错误：

InvalidS3ObjectException：调用 StartDocumentTextDetection 操作时发生错误 (InvalidS3ObjectException)：无法从 S3 获取对象元数据。检查对象键、区域和/或访问权限。

到目前为止我有：

从 boto3 会话检查了区域，存储桶和 aws 配置设置均设置为us-east-2。
密钥不能错，我直接从对象响应传递它
在权限方面，我检查了 IAM 控制台，并将其设置为AmazonS3FullAccess和AmazonTextractFullAccess。

这里可能出了什么问题？

[编辑]我确实重命名了这些文件，以便它们没有\\，但似乎仍然无法正常工作，这很奇怪..

python amazon-s3 amazon-web-services boto3 amazon-textract

oce*_*800

2021 06-19

6
推荐指数

1
解决办法

6137
查看次数

使用 Textract，如何从 pdf 文件中提取表格并通过 .py 脚本将其输出到 csv 文件中？

我想使用 textract （通过 aws cli）从 pdf 文件（位于 s3 位置）中提取表格并将其导出到 csv 文件中。我尝试编写 .py 脚本，但很难从文件中读取。欢迎任何有关编写 .py 脚本的建议。

这是我当前的脚本。我遇到错误：文件“extract-table.py”，第 63 行，在 get_table_csv_results bash 中：文件：找不到命令块=响应 ['块'] KeyError：'块'

import webbrowser, os
import json
import boto3
import io
from io import BytesIO
import sys
from pprint import pprint





def get_rows_columns_map(table_result, blocks_map):
rows = {}
for relationship in table_result['Relationships']:
    if relationship['Type'] == 'CHILD':
        for child_id in relationship['Ids']:
            cell = blocks_map[child_id]
            if cell['BlockType'] == 'CELL':
                row_index = cell['RowIndex']
                col_index = cell['ColumnIndex']
                if row_index not in rows:
                    # create …

Run Code Online (Sandbox Code Playgroud)

python text-extraction amazon-web-services amazon-textract

Chr*_*You

2021 06-19

6
推荐指数

1
解决办法

3566
查看次数

如何将 Amazon Textract 用于 PDF 文件

我已经可以使用文本提取但使用 JPEG 文件。我想将它与 PDF 文件一起使用。

我有以下代码：

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from …

Run Code Online (Sandbox Code Playgroud)

ocr text-extraction amazon-web-services amazon-textract

Art*_*urS

2019 11-26

5
推荐指数

2
解决办法

7329
查看次数

AWS Textract InvalidParameterException

我有一个 .Net core 客户端应用程序，根据 AWS 文档使用带有 S3、SNS 和 SQS 的 amazon Textract，检测和分析多页文档中的文本（https://docs.aws.amazon.com/textract/latest/dg/async .html )

使用 AmazonTextractServiceRole 策略创建 AWS 角色，并根据文档添加以下信任关系 ( https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html ) { "Version": "2012-10-17", "声明": [ { "效果": "允许", "主体": { "服务": "texttract.amazonaws.com" }, "操作": "sts:AssumeRole" } ] }

已订阅 SQS 主题并授予 Amazon SNS 主题按照 aws 文档将消息发送到 Amazon SQS 队列的权限。

所有资源包括S3 Bucket、SNS、SQS都位于同一个us-west2区域

以下方法显示一般错误“InvalidParameterException”请求具有无效参数

但如果NotificationChannel 部分被注释，则代码工作正常并返回正确的作业ID。

错误消息没有给出有关参数的清晰图片。高度赞赏任何帮助。

public async Task<string> ScanDocument()
{
            string roleArn = "aws:iam::xxxxxxxxxxxx:instance-profile/MyTextractRole";
            string topicArn = "aws:sns:us-west-2:xxxxxxxxxxxx:AmazonTextract-My-Topic";
            string bucketName = "mybucket";
            string filename = "mytestdoc.pdf";

            var …

Run Code Online (Sandbox Code Playgroud)

amazon-web-services .net-core amazon-textract

Nab*_*eel

lucky-day

5
推荐指数

2
解决办法

5913
查看次数

AWS Textract - 有没有办法区分哪些单词是粗体的？

我正在使用 AWS 的文档文本，但它似乎没有检测文本是否为粗体。是我缺少什么东西还是这不是一个功能？

amazon-web-services amazon-textract

DIR*_*AVE

2021 06-19

5
推荐指数

1
解决办法

1324
查看次数

如何在java中使用AWS Textract检索pdf中存在的表

我在下面找到了用python做的文章。

https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

我也使用下面的文章来提取文本。

https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html

但上面的文章只帮助获取文本，我还使用了 Block 的函数“block.getBlockType()”，但没有一个块将其类型返回为“CELL”，即使图像/pdf中有表格。

帮我找到类似于“boto3”的java库来提取所有表。

java amazon-web-services spring-boot amazon-textract

Far*_*han

2021 06-19

5
推荐指数

1
解决办法

1158
查看次数

在本地使用 Textract 进行 OCR

我想使用 Python 从图像中提取文本。（Tessaract lib 对我不起作用，因为它需要安装）。

我找到了 boto3 lib 和 Textract，但我在使用它时遇到了问题。我对此还很陌生。你能告诉我我需要做什么才能正确运行我的脚本吗？

这是我的代码：

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

Run Code Online (Sandbox Code Playgroud)

当我运行这段代码时，我得到：

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for …

Run Code Online (Sandbox Code Playgroud)

python amazon-web-services amazon-textract

tag*_*aga

2021 06-19

5
推荐指数

1
解决办法

6503
查看次数

Amazon textextract 我找不到 trp 模块

我想使用这个亚马逊表格文本提取脚本

我遇到的问题是我不知道什么是 trp 模块以及如何安装它。

我试过

pip install trp

Run Code Online (Sandbox Code Playgroud)

但是当我尝试运行时，我收到此错误

lib/python3.7/site-packages/trp/__init__.py", line 31
    print ip
           ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(ip)?

Run Code Online (Sandbox Code Playgroud)

python amazon-web-services python-3.x amazon-textract

Iak*_*ias

lucky-day

4
推荐指数

3
解决办法

2425
查看次数

如何从带有文本的图像中获取字体样式？

我正在Amazon Textract API通过 AWS 的 Python API 使用来从文档（pdf或jpg）中提取文本。我确实获得了其边界框的文本和坐标，但我也希望拥有字体类型（仅需要主要的字体类型：Arial、Helvetica、Verdana、Calibri、Times New Roman + 一些其他字体）。

有人有解决方案来获取该数据吗？

最好的解决方案可能是一个包，它接受小图像，返回字体类型名称，并且我可以在我的服务器上运行它。外部 API 很可能成本太高（金钱和时间方面），因为我必须在一秒钟内运行它 100 多次。

Amazon Textract 返回什么（不幸的是，没有字体类型）：

{'BlockType': 'LINE',
 'Confidence': 99.81985473632812,
 'Text': 'This is a text',
 'Geometry': {'BoundingBox': {'Width': 0.7395017743110657,
   'Height': 0.012546566314995289,
   'Left': 0.12995509803295135,
   'Top': 0.2536422610282898},
  'Polygon': [{'X': 0.12995509803295135, 'Y': 0.2536422610282898},
   {'X': 0.8694568872451782, 'Y': 0.2536422610282898},
   {'X': 0.8694568872451782, 'Y': 0.2661888301372528},
   {'X': 0.12995509803295135, 'Y': 0.2661888301372528}]},
 'Id': '59f42615-7f33-41d2-9f3c-77ae5e4b6e7a',
 'Relationships': ...}

Run Code Online (Sandbox Code Playgroud)