仅从格式为 .tex 的 arXiv 文章中提取正文文本

Question

仅从格式为 .tex 的 arXiv 文章中提取正文文本

bri*_*akh 6 python latex nlp extract tex

我的数据集由作为 .tex 文件的 arXiv 天体物理学文章组成，我只需要从文章正文中提取文本，而不是从文章的任何其他部分（例如表格、图表、摘要、标题、脚注、致谢、引文等） .)

我一直在尝试使用 Python3 和tex2py，但我正在努力获得一个干净的语料库，因为文件在标签上有所不同，并且文本在标签之间被分解。

我附上了一个 SSCCE、几个示例 Latex 文件及其 pdf，以及解析的语料库。语料库显示了我的挣扎：节和小节没有按顺序提取，在某些标签处文本中断，并且包含了一些表格和图形。

代码：

import os
from tex2py import tex2py

corpus = open('corpus2.tex', 'a')

def parseFiles():
    """
    Parses downloaded document .tex files for word content.
    We are only interested in the article body, defined by /section tags.
    """

    for file in os.listdir("latex"):
        if file.endswith('.tex'):
            print('\nChecking ' + file + '...')
            with open("latex/" + file) as f:
                try:
                    toc = tex2py(f) # toc = tree of contents
                    # If file is a document, defined as having \begin{document}
                    if toc.source.document:
                        # Iterate over each section in document
                        for section in toc:
                            # Parse the section
                            getText(section)
                    else:
                        print(file + ' is not a document. Discarded.')
                except (EOFError, TypeError, UnicodeDecodeError): 
                    print('Error: ' + file + ' was not correctly formatted. Discarded.')



def getText(section):
    """
    Extracts text from given "section" node and any nested "subsection" nodes. 

    Parameters
    ----------
    section : list
        A "section" node in a .tex document 
    """

    # For each element within the section 
    for x in section:
        if hasattr(x.source, 'name'):
            # If it is a subsection or subsubsection, parse it
            if x.source.name == 'subsection' or x.source.name == 'subsubsection':
                corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')
                getText(x)
            # Avoid parsing past these sections
            elif x.source.name == 'acknowledgements' or x.source.name == 'appendix':
                return
        # If element is text, add it to corpus
        elif isinstance(x.source, str):
            # If element is inline math, worry about it later
            if x.source.startswith('$') and x.source.endswith('$'):
                continue
            corpus.write(str(x))
        # If element is 'RArg' labelled, e.g. \em for italic, add it to corpus
        elif type(x.source).__name__ is 'RArg':
            corpus.write(str(x.source))


if __name__ == '__main__':
    """Runs if script called on command line"""
    parseFiles()

Run Code Online (Sandbox Code Playgroud)

其余链接：

我知道一个相关的问题（以编程方式转换/解析乳胶代码为纯文本），但似乎没有决定性的答案。

Answer 1

Alv*_*Wan 2

从文档中抓取所有文本，tree.descendants这里会友好很多。这将按顺序输出所有文本。

def getText(section):
    for token in section.descendants:
        if isinstance(token, str):
            corpus.write(str(x))

Run Code Online (Sandbox Code Playgroud)

为了捕捉边缘情况，我编写了一个稍微更充实的版本。这包括检查您在那里列出的所有条件。

from TexSoup import RArg

def getText(section):
    for x in section.descendants:
        if isinstance(x, str):
            if x.startswith('$') and x.endswith('$'):
                continue
            corpus.write(str(x))
        elif isinstance(x, RArg):
            corpus.write(str(x))
        elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'):
            return

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，2 月前
查看次数：	826 次
最近记录：	7 年，7 月前