小编ite*_*r07的帖子

Google Document Ai 为同一文件提供不同的输出

我使用 Document OCR API 从 pdf 文件中提取文本，但部分内容不准确。我发现原因可能是因为一些汉字的存在。

以下是我虚构的示例，其中我裁剪了提取文本错误的部分区域，并添加了一些汉字来重现该问题。

输入文件

当我使用网站版本时，我无法获取汉字，但其余字符是正确的。

网站版本OCR结果

当我使用Python提取文本时，我可以正确地获取汉字，但剩余的部分字符是错误的。

程序结果

我得到的实际字符串。

实际结果

网站和API中的Document AI版本是否不同？如何正确获取所有字符？

更新：

当我在打印文本后打印detected_languages（不知道为什么 for lines = page.lines，detected_languagesfor 两行都是空列表，需要更改为page.blocks或page.paragraphs首先）时，我得到以下输出。

语言代码

代码：

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI …

Run Code Online (Sandbox Code Playgroud)

python ocr google-api-python-client google-cloud-platform cloud-document-ai

ite*_*r07

2021 08-18

5
推荐指数

1
解决办法

989
查看次数