使用Python从文本文件到csv

Son*_*nny 3 python csv parsing text

我需要帮助解析一个非常长的文本文件,如下所示:

NAME         IMP4   
DESCRIPTION  small nucleolar ribonucleoprotein 
CLASS        Genetic Information Processing
             Translation
             Ribosome biogenesis in eukaryotes
DBLINKS      NCBI-GI: 15529982
             NCBI-GeneID: 92856
             OMIM: 612981
///
NAME         COMMD9
DESCRIPTION  COMM domain containing 9
ORGANISM     H.sapiens
DBLINKS      NCBI-GI: 156416007
             NCBI-GeneID: 29099
             OMIM: 612299
///
.....
Run Code Online (Sandbox Code Playgroud)

我想获得一个结构化的csv文件,每行中的列数相同,以便轻松提取我需要的信息.

首先我试着这样做:

for line in a:
    if '///' not in line:
        b.write(''.join(line.replace('\n', '\t')))
    else:
    b.write('\n')
Run Code Online (Sandbox Code Playgroud)

获得这样的csv:

NAME         IMP4\tDESCRIPTION  small nucleolar ribonucleoprotein\tCLASS        Genetic Information Processing\t             Translation\t             Ribosome biogenesis in eukaryotes\tDBLINKS      NCBI-GI: 15529982\t            NCBI-GeneID: 92856\t
         OMIM: 612981
NAME         COMMD9\tDESCRIPTION  COMM domain containing 9\tORGANISM     H.sapiens\tDBLINKS      NCBI-GI: 156416007\t             NCBI-GeneID: 29099t\             OMIM: 612299
Run Code Online (Sandbox Code Playgroud)

主要问题在于DBLINKS这样的字段,原始文件中的字段是多行的,这样就可以将结果分成几个字段,而我需要将它们全部合二为一.此外,并非所有字段都存在于每一行中,例如示例中的字段"CLASS"和"ORGANISM".

我想要获取的文件应如下所示:

NAME         IMP4\tDESCRIPTION  small nucleolar ribonucleoprotein\tNA\tCLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9\tDESCRIPTION  COMM domain containing 9\tORGANISM     H.sapiens\tNA\tDBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Run Code Online (Sandbox Code Playgroud)

请你帮助我好吗?

unu*_*tbu 5

您可以使用itertools.groupby,一次将行收集到记录中,第二次将多行字段收集到迭代器中:

import csv
import itertools

def is_end_of_record(line):
    return line.startswith('///')

class FieldClassifier(object):
    def __init__(self):
        self.field=''
    def __call__(self,row):
        if not row[0].isspace():
            self.field=row.split(' ',1)[0]
        return self.field

fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
    for end_of_record, lines in itertools.groupby(f,is_end_of_record):
        if not end_of_record:
            classifier=FieldClassifier()
            record={}
            for fieldname, row in itertools.groupby(lines,classifier):
                record[fieldname]='; '.join(r.strip() for r in row)
            print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))
Run Code Online (Sandbox Code Playgroud)

产量

NAME         IMP4   DESCRIPTION  small nucleolar ribonucleoprotein  NA  CLASS        Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS      NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME         COMMD9 DESCRIPTION  COMM domain containing 9   ORGANISM     H.sapiens  NA  DBLINKS      NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Run Code Online (Sandbox Code Playgroud)

以上是您看到的打印输出.它匹配您发布的所需输出,假设您显示了repr该输出.


所用工具的参考:

  • 谢谢,但你不必.我在这里发布的所有内容都是免费的.如果您想引用来源,请引用此页面. (3认同)