我曾经有一个 awk 命令,它可以很好地在输出文件中的最后一项(第 10 个逗号之后)周围加上引号,这样当我将它们作为 CSV 文件打开时,最后一项就不会因为它的原因而被分割。额外的逗号。然而,由于某种原因, awk 命令被破坏了(我从来没有想到有人帮助我创建它)并且它返回一个包含许多空行或已删除数据的文件。
这是我的初始输出文件的示例:
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial
Run Code Online (Sandbox Code Playgroud)
这就是我想要的输出文件的样子 - 它只是在最后一项周围有引号,即物种的全名及其测序信息。
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)
理想情况下,如果可能的话,我也想要一种将物种名称与名称的其余部分分开的方法,因此输出将是:
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata", ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.", "4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)
我之前使用过的 awk 命令似乎运行良好,但现在我的输出如下:
for file in *.txt; do awk '{ match($0,/((\S+\s+){10})(.*)/,a); gsub(/\s+/,",",a[1]); gsub(/"/,"&&",a[3]); print a[1] "\"" a[3] "\"" }' $file > ${file%%.txt}_cleanedup.txt; done
Run Code Online (Sandbox Code Playgroud)
输出(它没有在正确的部分加上引号,应该是第 10 个逗号之后的所有内容)。此外,空引号仅意味着数据由于某种原因被删除:
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum,amygdalinum,voucher,PCG/UNN/030-52,ribulose-1,5-bisphosphate,carboxylase/oxygenase,large,subunit,(rbcL),gene,,"partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum,amygdalinum,voucher,PCG/UNN/030-52,ribulose-1,5-bisphosphate,carboxylase/oxygenase,large,subunit,(rbcL),gene,,"partial cds; chloroplast"
""
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica,sp.,4,KS-2019,ribulose-1,5-bisphosphate,carboxylase/oxygenase,(rbcL),pseudogene,,partial,sequence;,"mitochondrial"
Run Code Online (Sandbox Code Playgroud)
你可以使用这个awk:
awk 'match($0, /^([^,]*,){10}/) {
print substr($0, 1, RLENGTH) "\"" substr($0, RLENGTH+1) "\""
}' file
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
63 次 |
| 最近记录: |