我有一个包含以下信息的文件:
gene 3025..3855
/gene="Sp34_10000100"
/ID="Sp34_10000100"
CDS join(3025..3106,3722..3855)
/gene="Sp34_10000100"
/codon_start=1
/ID="Sp34_10000100.t1.cds1,Sp34_10000100.t1.cds2"
mRNA 3025..3855
/ID="Sp34_10000100.t1"
/gene="Sp34_10000100"
gene 12640..13470
/gene="Sp34_10000200"
/ID="Sp34_10000200"
CDS join(12640..12721,13337..13470)
/gene="Sp34_10000200"
/codon_start=1
/ID="Sp34_10000200.t1.cds1,Sp34_10000200.t1.cds2"
mRNA 12640..13470
/ID="Sp34_10000200.t1"
/gene="Sp34_10000200"
gene 15959..20678
/gene="Sp34_10000300"
/ID="Sp34_10000300"
CDS join(15959..16080,16268..16367,18913..19116,20469..20524,20582..20678)
/gene="Sp34_10000300"
/codon_start=1
/ID="Sp34_10000300.t1.cds1,Sp34_10000300.t1.cds2,Sp34_10000300.t1.cds3,Sp34_10000300.t1.cds4,Sp34_10000300.t1.cds5"
mRNA 15959..20678
/ID="Sp34_10000300.t1"
/gene="Sp34_10000300"
gene 22255..23085
/gene="Sp34_10000400"
/ID="Sp34_10000400"
Run Code Online (Sandbox Code Playgroud)
我想删除所有基因部分,但CDS和mRNA信息应该在那里。输出应该是这样的:
CDS join(3025..3106,3722..3855)
/gene="Sp34_10000100"
/codon_start=1
/ID="Sp34_10000100.t1.cds1,Sp34_10000100.t1.cds2"
mRNA 3025..3855
/ID="Sp34_10000100.t1"
/gene="Sp34_10000100"
CDS join(12640..12721,13337..13470)
/gene="Sp34_10000200"
/codon_start=1
/ID="Sp34_10000200.t1.cds1,Sp34_10000200.t1.cds2"
mRNA 12640..13470
/ID="Sp34_10000200.t1"
/gene="Sp34_10000200"
CDS join(15959..16080,16268..16367,18913..19116,20469..20524,20582..20678)
/gene="Sp34_10000300"
/codon_start=1
/ID="Sp34_10000300.t1.cds1,Sp34_10000300.t1.cds2,Sp34_10000300.t1.cds3,Sp34_10000300.t1.cds4,Sp34_10000300.t1.cds5"
mRNA 15959..20678
/ID="Sp34_10000300.t1"
/gene="Sp34_10000300"
Run Code Online (Sandbox Code Playgroud)
请给我任何建议如何做到这一点。
awk 通常更容易阅读和理解:
这是一个简单的程序,默认情况下写入,当它看到第一个单词是“gene”的行时,将“wewrite”切换为“0”(=关闭,我们不会写入),并在他看到第一个单词是“gene”时将其重新打开。看到第一个单词是“CDS”或“mRNA”的行:
awk '
BEGIN { weprint=1 }
( $1 == "gene" ) { weprint=0 }
( $1 == "CDS" ) || ( $1 == "mRNA" ) { weprint=1 }
( weprint == 1) { print $0 ;}
' file_to_read
Run Code Online (Sandbox Code Playgroud)
BEGIN 在读取任何行之前完成。
另一个( test ) { action if test successful }针对每行输入进行解析(...除非操作包含next,否则它将忽略其余的内容,而是会获取下一行输入)
这只会打印“CDS”和“mRNA”部分,而不打印“基因”部分
这可能是“高尔夫球”(例如,成功“测试”的默认操作是打印 $0,因此您可以像( weprint == 1)最后一行一样,但在我看来,掌握起来不太清楚......)