ite*_*r07 5 python ocr google-api-python-client google-cloud-platform cloud-document-ai
我使用 Document OCR API 从 pdf 文件中提取文本,但部分内容不准确。我发现原因可能是因为一些汉字的存在。
以下是我虚构的示例,其中我裁剪了提取文本错误的部分区域,并添加了一些汉字来重现该问题。
当我使用网站版本时,我无法获取汉字,但其余字符是正确的。
当我使用Python提取文本时,我可以正确地获取汉字,但剩余的部分字符是错误的。
我得到的实际字符串。
网站和API中的Document AI版本是否不同?如何正确获取所有字符?
更新:
当我在打印文本后打印detected_languages
(不知道为什么 for lines = page.lines
,detected_languages
for 两行都是空列表,需要更改为page.blocks
或page.paragraphs
首先)时,我得到以下输出。
代码:
from google.cloud import documentai_v1beta3 as documentai
project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' # Create processor in Cloud Console
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
# opts = {}
# if location == "eu":
# opts = {"api_endpoint": "eu-documentai.googleapis.com"}
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
response_text = []
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
lines = page.blocks
for line in lines:
block_text = get_text(line.layout, document)
confidence = line.layout.confidence
response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
print(f"Text: {block_text}")
print("Detected Language", line.detected_languages)
return response_text
if __name__ == '__main__':
print(get_lines_of_text('/pdf path'))
Run Code Online (Sandbox Code Playgroud)
好像语言代码错误,会影响结果吗?
发布此内容Community Wiki
是为了更好visibility
。
OCR的功能之一DocumentAI
是光学字符识别,它允许识别各种文件中的文本。
在这种情况下,OP 使用Try it函数和客户端库 - Python收到了不同的输出。
Try it
和 之间为何存在差异Python library
?很难说,因为这两种方法都使用相同的 API documentai_v1beta3
。Try it Demo
它可能与 pdf 上传到时的某些文件修改、不同端点、语言字母识别或一些随机内容
有关。
当您使用时,Python Client
您还可以获得文本识别的准确性%。以下是我的睾丸的例子:
然而,OP 的识别0,73
可能会得到错误的结果,在这种情况下这是一个明显的问题。我想无论如何都无法使用代码来改进它。也许 PDF 的质量会有所不同(在所示的 OP 示例中,有一些点可能会影响识别)。
归档时间: |
|
查看次数: |
989 次 |
最近记录: |