Dai*_*ail 5 python nlp artificial-intelligence machine-learning
我需要分析不同文档类型的布局结构,如:pdf、doc、docx、odt等。
我的任务是:给出一个文档,将文本分组,找到每个块的正确边界。
我使用 Apache Tika 做了一些测试,这是一个很好的提取器,它是一个非常好的工具,但它经常弄乱块的顺序,让我解释一下我对 ORDER 的意思。
Apache Tika 只是提取文本,所以如果我的文档有两列,Tika 提取第一列的整个文本,然后提取第二列的文本,这是可以的...但有时第一列上的文本与第二个文本,就像一个有行关系的表格。
所以我必须照顾每个块的位置,所以问题是:
定义框边界,这很难......我应该理解一个句子是否开始一个新的块。
定义方向,例如,给一个表格“句子”应该是行,而不是列。
所以基本上在这里我必须处理布局结构以正确理解块边界。
我给你一个直观的例子:
一个经典的提取器返回:
2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...
Run Code Online (Sandbox Code Playgroud)
这是错误的(就我而言),因为日期与右侧的文本有关。
这个任务是为其他 NLP 分析做准备,所以它非常重要,因为,例如,当我需要识别文本中的实体(NER),然后识别它们的关系时,使用正确的上下文是非常重要的。
如何从文档和装配相关的文本块中提取文本(了解文档的布局结构)在同一块下?
这只是您问题的部分解决方案,但它可以简化手头的任务。\n该工具接收 PDF 文件并将其转换为文本文件。它的运行速度非常快,并且可以运行大量文件。
\n它为每个 PDF 创建一个输出文本文件。该工具相对于其他工具的优点是输出文本按照其原始布局对齐。
\n例如,这是一份布局复杂的简历:
\n\n其输出是以下文本文件:
\nChristopher Summary\n Senior Web Developer specializing in front end development.\nMorgan Experienced with all stages of the development cycle for\n dynamic web projects. Well-versed in numerous programming\n languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.\n Strong background in project management and customer\n relations.\n\n\n Skill Highlights\n \xe2\x80\xa2 Project management \xe2\x80\xa2 Creative design\n \xe2\x80\xa2 Strong decision maker \xe2\x80\xa2 Innovative\n \xe2\x80\xa2 Complex problem \xe2\x80\xa2 Service-focused\n solver\n\n\n Experience\nContact\n Web Developer - 09/2015 to 05/2019\nAddress: Luna Web Design, New York\n177 Great Portland Street, London \xe2\x80\xa2 Cooperate with designers to create clean interfaces and\nW5W 6PQ simple, intuitive interactions and experiences.\n \xe2\x80\xa2 Develop project concepts and maintain optimal\nPhone: workflow.\n+44 (0)20 7666 8555\n \xe2\x80\xa2 Work with senior developer to manage large, complex\n design projects for corporate clients.\nEmail:\n \xe2\x80\xa2 Complete detailed programming and development tasks\nchristoper.m@gmail.com\n for front end public and internal websites as well as\n challenging back-end server code.\nLinkedIn:\n \xe2\x80\xa2 Carry out quality assurance tests to discover errors and\nlinkedin.com/christopher.morgan\n optimize usability.\n\nLanguages Education\nSpanish \xe2\x80\x93 C2\n Bachelor of Science: Computer Information Systems - 2014\nChinese \xe2\x80\x93 A1\n Columbia University, NY\nGerman \xe2\x80\x93 A2\n\n\nHobbies Certifications\n PHP Framework (certificate): Zend, Codeigniter, Symfony.\n \xe2\x80\xa2 Writing\n Programming Languages: JavaScript, HTML5, PHP OOP, CSS,\n \xe2\x80\xa2 Sketching\n SQL, MySQL.\n \xe2\x80\xa2 Photography\n \xe2\x80\xa2 Design\n-----------------------Page 1 End-----------------------\nRun Code Online (Sandbox Code Playgroud)\n现在,您的任务简化为查找文本文件中的大量内容,并使用单词之间的空格作为对齐提示。\n作为开始,我包含一个脚本,该脚本查找文本列之间的边距并产生 - 的rhs文本lhs流分别是右列和左列。
import numpy as np\nimport matplotlib.pyplot as plt\nimport re\n\ntxt_lines = txt.split(\'\\n\')\nmax_line_index = max([len(line) for line in txt_lines])\npadded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces\nspace_idx_counters = np.zeros(max_line_index)\n\nfor idx, line in enumerate(padded_txt_lines):\n if line.find("-----------------------Page") >= 0: # reached end of page\n break\n space_idxs = [pos for pos, char in enumerate(line) if char == " "]\n space_idx_counters[space_idxs] += 1\n\npadded_txt_lines = padded_txt_lines[:idx] #remove end page line\n\n# plot histogram of spaces in each character column\nplt.bar(list(range(len(space_idx_counters))), space_idx_counters)\nplt.title("Number of spaces in each column over all lines")\nplt.show()\n\n# find the separator column idx\nseparator_idx = np.argmax(space_idx_counters)\nprint(f"separator index: {separator_idx}")\nleft_lines = []\nright_lines = []\n\n# separate two columns of text\nfor line in padded_txt_lines:\n left_lines.append(line[:separator_idx])\n right_lines.append(line[separator_idx:])\n\n# join each bulk into one stream of text, remove redundant spaces\nlhs = \' \'.join(left_lines)\nlhs = re.sub("\\s{4,}", " ", lhs)\nrhs = \' \'.join(right_lines)\nrhs = re.sub("\\s{4,}", " ", rhs)\n\nprint("************ Left Hand Side ************")\nprint(lhs)\nprint("************ Right Hand Side ************")\nprint(rhs)\nRun Code Online (Sandbox Code Playgroud)\n绘图输出:
\n\n文本输出:
\nseparator index: 33\n************ Left Hand Side ************\nChristopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: christoper.m@gmail.com LinkedIn: linkedin.com/christopher.morgan Languages Spanish \xe2\x80\x93 C2 Chinese \xe2\x80\x93 A1 German \xe2\x80\x93 A2 Hobbies \xe2\x80\xa2 Writing \xe2\x80\xa2 Sketching \xe2\x80\xa2 Photography \xe2\x80\xa2 Design \n************ Right Hand Side ************\n Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights \xe2\x80\xa2 Project management \xe2\x80\xa2 Creative design \xe2\x80\xa2 Strong decision maker \xe2\x80\xa2 Innovative \xe2\x80\xa2 Complex problem \xe2\x80\xa2 Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York \xe2\x80\xa2 Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. \xe2\x80\xa2 Develop project concepts and maintain optimal workflow. \xe2\x80\xa2 Work with senior developer to manage large, complex design projects for corporate clients. \xe2\x80\xa2 Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. \xe2\x80\xa2 Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL. \nRun Code Online (Sandbox Code Playgroud)\n下一步是将该脚本推广到多页文档,删除多余的符号等。
\n祝你好运!
\n