Son*_*nny 3 python csv parsing text
我需要帮助解析一个非常长的文本文件,如下所示:
NAME IMP4
DESCRIPTION small nucleolar ribonucleoprotein
CLASS Genetic Information Processing
Translation
Ribosome biogenesis in eukaryotes
DBLINKS NCBI-GI: 15529982
NCBI-GeneID: 92856
OMIM: 612981
///
NAME COMMD9
DESCRIPTION COMM domain containing 9
ORGANISM H.sapiens
DBLINKS NCBI-GI: 156416007
NCBI-GeneID: 29099
OMIM: 612299
///
.....
Run Code Online (Sandbox Code Playgroud)
我想获得一个结构化的csv文件,每行中的列数相同,以便轻松提取我需要的信息.
首先我试着这样做:
for line in a:
if '///' not in line:
b.write(''.join(line.replace('\n', '\t')))
else:
b.write('\n')
Run Code Online (Sandbox Code Playgroud)
获得这样的csv:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tCLASS Genetic Information Processing\t Translation\t Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982\t NCBI-GeneID: 92856\t
OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tDBLINKS NCBI-GI: 156416007\t NCBI-GeneID: 29099t\ OMIM: 612299
Run Code Online (Sandbox Code Playgroud)
主要问题在于DBLINKS这样的字段,原始文件中的字段是多行的,这样就可以将结果分成几个字段,而我需要将它们全部合二为一.此外,并非所有字段都存在于每一行中,例如示例中的字段"CLASS"和"ORGANISM".
我想要获取的文件应如下所示:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tNA\tCLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tNA\tDBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Run Code Online (Sandbox Code Playgroud)
请你帮助我好吗?
您可以使用itertools.groupby,一次将行收集到记录中,第二次将多行字段收集到迭代器中:
import csv
import itertools
def is_end_of_record(line):
return line.startswith('///')
class FieldClassifier(object):
def __init__(self):
self.field=''
def __call__(self,row):
if not row[0].isspace():
self.field=row.split(' ',1)[0]
return self.field
fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
for end_of_record, lines in itertools.groupby(f,is_end_of_record):
if not end_of_record:
classifier=FieldClassifier()
record={}
for fieldname, row in itertools.groupby(lines,classifier):
record[fieldname]='; '.join(r.strip() for r in row)
print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))
Run Code Online (Sandbox Code Playgroud)
产量
NAME IMP4 DESCRIPTION small nucleolar ribonucleoprotein NA CLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9 DESCRIPTION COMM domain containing 9 ORGANISM H.sapiens NA DBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
Run Code Online (Sandbox Code Playgroud)
以上是您看到的打印输出.它匹配您发布的所需输出,假设您显示了repr该输出.
所用工具的参考: