bash:按模式、空字段和多次出现提取子串

Pol*_*ova 0 regex bash awk extract

我想Pfam_A从文件的每一行中提取信息:

item_1    ID=HJNANFJJ_180142;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_180142;partial=01;product=unannotated protein;KEGG=K03531
item_4    ID=HJNANFJJ_87662;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_87662;partial=10;product=unannotated protein;KEGG=K15725;Pfam_A=OEP;Resfams=adeC-adeK-oprM
item_8    ID=HJNANFJJ_328505;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_328505;partial=11;product=unannotated protein;KEGG=K03578;Pfam_A=OB_NTP_bind    
item_2    ID=HJNANFJJ_512995;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_512995;partial=11;product=unannotated protein;KEGG=K00674;Pfam_A=Hexapep;Pfam_A=Hexapep_2;metacyc=TETHYDPICSUCC-RXN
item_0    ID=HJNANFJJ_188729;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_188729;partial=11;product=unannotated protein
Run Code Online (Sandbox Code Playgroud)

在某些行中,此信息完全缺失,在某些行中可能会出现多次。

最后,我想得到一个这样的表,这样就不是空字段,而是将NaN多次出现的选项卡分开到不同的字段中:

item_1    NaN
item_4    OEP
item_8    OB_NTP_bind    
item_2    Hexapep    Hexapep_2
item_0    NaN
Run Code Online (Sandbox Code Playgroud)

anu*_*ava 5

你可以使用这个awk

awk -v OFS='\t' 'NF > 1 {
   s = ""
   n = split($NF, a, /;/)
   for (i=1; i<=n; i++)
      if (split(a[i], b, /=/) == 2 && b[1] == "Pfam_A")
         s = s OFS b[2]
   print $1 (s ? s : OFS "NaN")
}' file
Run Code Online (Sandbox Code Playgroud)
item_1  NaN
item_4  OEP
item_8  OB_NTP_bind
item_2  Hexapep Hexapep_2
item_0  NaN
Run Code Online (Sandbox Code Playgroud)