bri*_*akh 6 python latex nlp extract tex
我的数据集由作为 .tex 文件的 arXiv 天体物理学文章组成,我只需要从文章正文中提取文本,而不是从文章的任何其他部分(例如表格、图表、摘要、标题、脚注、致谢、引文等) .)
我一直在尝试使用 Python3 和tex2py,但我正在努力获得一个干净的语料库,因为文件在标签上有所不同,并且文本在标签之间被分解。
我附上了一个 SSCCE、几个示例 Latex 文件及其 pdf,以及解析的语料库。语料库显示了我的挣扎:节和小节没有按顺序提取,在某些标签处文本中断,并且包含了一些表格和图形。
代码:
import os
from tex2py import tex2py
corpus = open('corpus2.tex', 'a')
def parseFiles():
"""
Parses downloaded document .tex files for word content.
We are only interested in the article body, defined by /section tags.
"""
for file in os.listdir("latex"):
if file.endswith('.tex'):
print('\nChecking ' + file + '...')
with open("latex/" + file) as f:
try:
toc = tex2py(f) # toc = tree of contents
# If file is a document, defined as having \begin{document}
if toc.source.document:
# Iterate over each section in document
for section in toc:
# Parse the section
getText(section)
else:
print(file + ' is not a document. Discarded.')
except (EOFError, TypeError, UnicodeDecodeError):
print('Error: ' + file + ' was not correctly formatted. Discarded.')
def getText(section):
"""
Extracts text from given "section" node and any nested "subsection" nodes.
Parameters
----------
section : list
A "section" node in a .tex document
"""
# For each element within the section
for x in section:
if hasattr(x.source, 'name'):
# If it is a subsection or subsubsection, parse it
if x.source.name == 'subsection' or x.source.name == 'subsubsection':
corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')
getText(x)
# Avoid parsing past these sections
elif x.source.name == 'acknowledgements' or x.source.name == 'appendix':
return
# If element is text, add it to corpus
elif isinstance(x.source, str):
# If element is inline math, worry about it later
if x.source.startswith('$') and x.source.endswith('$'):
continue
corpus.write(str(x))
# If element is 'RArg' labelled, e.g. \em for italic, add it to corpus
elif type(x.source).__name__ is 'RArg':
corpus.write(str(x.source))
if __name__ == '__main__':
"""Runs if script called on command line"""
parseFiles()
Run Code Online (Sandbox Code Playgroud)
其余链接:
我知道一个相关的问题(以编程方式转换/解析乳胶代码为纯文本),但似乎没有决定性的答案。
从文档中抓取所有文本,tree.descendants这里会友好很多。这将按顺序输出所有文本。
def getText(section):
for token in section.descendants:
if isinstance(token, str):
corpus.write(str(x))
Run Code Online (Sandbox Code Playgroud)
为了捕捉边缘情况,我编写了一个稍微更充实的版本。这包括检查您在那里列出的所有条件。
from TexSoup import RArg
def getText(section):
for x in section.descendants:
if isinstance(x, str):
if x.startswith('$') and x.endswith('$'):
continue
corpus.write(str(x))
elif isinstance(x, RArg):
corpus.write(str(x))
elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'):
return
Run Code Online (Sandbox Code Playgroud)