Google Document Ai 为同一文件提供不同的输出

Question

Google Document Ai 为同一文件提供不同的输出

ite*_*r07 5 python ocr google-api-python-client google-cloud-platform cloud-document-ai

我使用 Document OCR API 从 pdf 文件中提取文本，但部分内容不准确。我发现原因可能是因为一些汉字的存在。

以下是我虚构的示例，其中我裁剪了提取文本错误的部分区域，并添加了一些汉字来重现该问题。

输入文件

当我使用网站版本时，我无法获取汉字，但其余字符是正确的。

网站版本OCR结果

当我使用Python提取文本时，我可以正确地获取汉字，但剩余的部分字符是错误的。

程序结果

我得到的实际字符串。

实际结果

网站和API中的Document AI版本是否不同？如何正确获取所有字符？

更新：

当我在打印文本后打印detected_languages（不知道为什么 for lines = page.lines，detected_languagesfor 两行都是空列表，需要更改为page.blocks或page.paragraphs首先）时，我得到以下输出。

语言代码

代码：

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content = image.read()

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))

Run Code Online (Sandbox Code Playgroud)

好像语言代码错误，会影响结果吗？

Answer 1

Pjo*_*erS 1

发布此内容Community Wiki是为了更好visibility。

OCR的功能之一DocumentAI是光学字符识别，它允许识别各种文件中的文本。

在这种情况下，OP 使用Try it函数和客户端库 - Python收到了不同的输出。

Try it和之间为何存在差异Python library？很难说，因为这两种方法都使用相同的 API documentai_v1beta3。Try it Demo它可能与 pdf 上传到时的某些文件修改、不同端点、语言字母识别或一些随机内容有关。

当您使用时，Python Client您还可以获得文本识别的准确性%。以下是我的睾丸的例子：

然而，OP 的识别0,73可能会得到错误的结果，在这种情况下这是一个明显的问题。我想无论如何都无法使用代码来改进它。也许 PDF 的质量会有所不同（在所示的 OP 示例中，有一些点可能会影响识别）。

归档时间：	4 年，6 月前
查看次数：	989 次
最近记录：	4 年，6 月前