Python从多个txt文件解析文本

sam*_*n13 15 python parsing dictionary nlp

寻求有关如何从多个文本文件中挖掘项目以构建字典的建议.

此文本文件:https://pastebin.com/Npcp3HCM

手动转换为此必需的数据结构:https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view

有数千个这样的文本文件,它们可能有不同的章节标题,如下例所示:

  1. https://pastebin.com/wWSPGaLX
  2. https://pastebin.com/9Up4RWHu

我开始阅读文件

from glob import glob

txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)

with open(txtFiles[0],'r') as tf:
    allLines = [line.rstrip() for line in tf]

sectionHeading = ['Corporate Participants',
                  'Conference Call Participiants',
                  'Presentation',
                  'Questions and Answers']

for lineNum, line in enumerate(allLines):
    if line in sectionHeading:
        print(lineNum,allLines[lineNum])
Run Code Online (Sandbox Code Playgroud)

我的想法是查找段标题存在的行号,并尝试在这些行号之间提取内容,然后删除像破折号一样的分隔符.这不起作用,我试图创建这种字典,以便我以后可以在采石项目上运行各种自然语言处理算法.

{file-name1:{
    {date-time:[string]},
    {corporate-name:[string]},
    {corporate-participants:[name1,name2,name3]},
    {call-participants:[name4,name5]},
    {section-headings:{
        {heading1:[
            {name1:[speechOrderNum, text-content]},
            {name2:[speechOrderNum, text-content]},
            {name3:[speechOrderNum, text-content]}],
        {heading2:[
            {name1:[speechOrderNum, text-content]},
            {name2:[speechOrderNum, text-content]},
            {name3:[speechOrderNum, text-content]},
            {name2:[speechOrderNum, text-content]},
            {name1:[speechOrderNum, text-content]},
            {name4:[speechOrderNum, text-content]}],
        {heading3:[text-content]},
        {heading4:[text-content]}
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

挑战在于不同的文件可能有不同的标题和标题数量.但总会有一个名为"演示文稿"的部分,很可能会有"问答"部分.这些章节标题总是由一串相等的符号分隔.不同说话者的内容总是用破折号串分开.问答部分的"语音顺序"用方括号中的数字表示.参与者总是在文档的开头指示,在其名称前面带有星号,并且他们的图块始终在下一行.

任何关于如何解析文本文件的建议都表示赞赏.理想的帮助是提供有关如何为每个文件生成这样的字典(或其他合适的数据结构)的指导,然后可以将其写入数据库.

谢谢

- 编辑 -

其中一个文件如下所示:https://pastebin.com/MSvmHb2e

其中"问答"部分被误标为"演示文稿",并且没有其他"问答"部分.

最后的示例文本:https://pastebin.com/jr9WfpV8

ent*_*phy 8

代码中的注释应该解释一切.如果有任何指定,请告诉我,并需要更多评论.

简而言之,我利用正则表达式找到'='分隔符行来将整个文本细分为子部分,然后为了清楚起见分别处理每种类型的部分(这样你就可以告诉我如何处理每个案例).

旁注:我正在互换使用'参与者'和'作者'这个词.

编辑:更新了代码,以根据演示文稿/质量保证部分中与会者/作者旁边的"[x]"模式进行排序.还改变了漂亮的打印部分,因为pprint不能很好地处理OrderedDict.

要删除任何其他空格,包括\n字符串中的任何位置,只需执行str.strip().如果你只需要剥离\n,那就做吧str.strip('\n').

我修改了代码以去除会话中的任何空格.

import json
import re
from collections import OrderedDict
from pprint import pprint


# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
#                       asdfasdfasdf
#                       sdfasdfdfsdfsdf
#                       =============================
#                       asdfsdfasdfasd
#                       =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
    equ_pattern = re.compile(regex, re.MULTILINE)
    sections = equ_pattern.split(lines)
    sections = [section.strip('\n') for section in sections]
    return sections


# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):

    subsections = subdivide(section, "^-+")
    heading = subsections[0]  # header of the section.
    d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
    index_pattern = re.compile("\[(.+)\]", re.MULTILINE)

    # sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
    # Then this is passed as a compare function to 'sorted' to sort based on 'x'.
    def cmp(d):
        mat = index_pattern.findall(d[0])
        if mat:
            print(mat[0])
            return int(mat[0])
        # There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
        # This is just to deal with that small issue.
        else:
            return 0

    o_d = OrderedDict(sorted(d.items(), key=cmp))
    return heading, o_d


# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man   ceo of company   [1]' : ' This is talk a' ,
#  ...     'ms. woman  ceo of company    [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
    new_d = OrderedDict()
    for key, value in d.items():
        a = [a for a in attendees if a in key]
        if len(a) == 1:
            # to strip out any additional whitespace anywhere in the text including '\n'.
            new_d[a[0]] = value.strip()
        elif len(a) == 0:
            # to strip out any additional whitespace anywhere in the text including '\n'.
            new_d[key] = value.strip()
    return new_d


if __name__ == '__main__':
    with open('input.txt', 'r') as input:
        lines = input.read()

        # regex pattern for matching headers of each section
        header_pattern = re.compile("^.*[^\n]", re.MULTILINE)

        # regex pattern for matching the sections that contains
        # the list of attendee's (those that start with asterisks )
        ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)

        # regex pattern for matching sections with subsections in them.
        dash_pattern = re.compile("^-+", re.MULTILINE)

        ppl_d = dict()
        talks_d = dict()

        # Step1. Divide the the entire document into sections using the '=' divider
        sections = subdivide(lines, "^=+")
        header = []
        print(sections)
        # Step2. Handle each section like a switch case
        for section in sections:

            # Handle headers
            if len(section.split('\n')) == 1:  # likely to match only a header (assuming )
                header = header_pattern.match(section).string

            # Handle attendees/authors
            elif ppl_pattern.match(section):
                ppls = ppl_pattern.findall(section)
                d = {key.strip(): value.strip() for (_, key, value) in ppls}
                ppl_d.update(d)

                # assuming that if the previous section was detected as a header, then this section will relate
                # to that header
                if header:
                    talks_d.update({header: ppl_d})

            # Handle subsections
            elif dash_pattern.findall(section):
                heading, d = process_dashed_sections(section)

                talks_d.update({heading: d})

            # Else its just some random text.
            else:

                # assuming that if the previous section was detected as a header, then this section will relate
                # to that header
                if header:
                    talks_d.update({header: section})

        #pprint(talks_d)
        # To assign the talks material to the appropriate attendee/author. Still works if no match found.
        for key, value in talks_d.items():
            talks_d[key] = assign_attendee(value, ppl_d.keys())

        # ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
        print(json.dumps(talks_d, indent=4))
Run Code Online (Sandbox Code Playgroud)