这是我正在尝试解析的复杂制表符分隔文件的示例
ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
Run Code Online (Sandbox Code Playgroud)
我正在尝试获取仅包含一些字段的输出,例如:
ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY …Run Code Online (Sandbox Code Playgroud) 我需要帮助解析一个非常长的文本文件,如下所示:
NAME IMP4
DESCRIPTION small nucleolar ribonucleoprotein
CLASS Genetic Information Processing
Translation
Ribosome biogenesis in eukaryotes
DBLINKS NCBI-GI: 15529982
NCBI-GeneID: 92856
OMIM: 612981
///
NAME COMMD9
DESCRIPTION COMM domain containing 9
ORGANISM H.sapiens
DBLINKS NCBI-GI: 156416007
NCBI-GeneID: 29099
OMIM: 612299
///
.....
Run Code Online (Sandbox Code Playgroud)
我想获得一个结构化的csv文件,每行中的列数相同,以便轻松提取我需要的信息.
首先我试着这样做:
for line in a:
if '///' not in line:
b.write(''.join(line.replace('\n', '\t')))
else:
b.write('\n')
Run Code Online (Sandbox Code Playgroud)
获得这样的csv:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tCLASS Genetic Information Processing\t Translation\t Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982\t NCBI-GeneID: 92856\t
OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM …Run Code Online (Sandbox Code Playgroud)