在第 10 个逗号后的项目周围加上引号

Kat*_*hau 3 awk

我曾经有一个 awk 命令,它可以很好地在输出文件中的最后一项(第 10 个逗号之后)周围加上引号,这样当我将它们作为 CSV 文件打开时,最后一项就不会因为它的原因而被分割。额外的逗号。然而,由于某种原因, awk 命令被破坏了(我从来没有想到有人帮助我创建它)并且它返回一个包含许多空行或已删除数据的文件。

这是我的初始输出文件的示例:

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial
Run Code Online (Sandbox Code Playgroud)

这就是我想要的输出文件的样子 - 它只是在最后一项周围有引号,即物种的全名及其测序信息。

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)

理想情况下,如果可能的话,我也想要一种将物种名称与名称的其余部分分开的方法,因此输出将是:

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"   
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum", "voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"        
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata", ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"     
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp.", "4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)

我之前使用过的 awk 命令似乎运行良好,但现在我的输出如下:

for file in *.txt; do awk '{ match($0,/((\S+\s+){10})(.*)/,a); gsub(/\s+/,",",a[1]); gsub(/"/,"&&",a[3]); print a[1] "\"" a[3] "\"" }' $file > ${file%%.txt}_cleanedup.txt; done
Run Code Online (Sandbox Code Playgroud)

输出(它没有在正确的部分加上引号,应该是第 10 个逗号之后的所有内容)。此外,空引号仅意味着数据由于某种原因被删除:

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum,amygdalinum,voucher,PCG/UNN/030-52,ribulose-1,5-bisphosphate,carboxylase/oxygenase,large,subunit,(rbcL),gene,,"partial cds; chloroplast"

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,Gymnanthemum,amygdalinum,voucher,PCG/UNN/030-52,ribulose-1,5-bisphosphate,carboxylase/oxygenase,large,subunit,(rbcL),gene,,"partial cds; chloroplast"

""

023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,Brassica,sp.,4,KS-2019,ribulose-1,5-bisphosphate,carboxylase/oxygenase,(rbcL),pseudogene,,partial,sequence;,"mitochondrial"
Run Code Online (Sandbox Code Playgroud)

anu*_*ava 5

你可以使用这个awk

awk 'match($0, /^([^,]*,){10}/) {
   print substr($0, 1, RLENGTH) "\"" substr($0, RLENGTH+1) "\""
}' file

e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
e7479580f6f3be15b5632f64f9de8df7,gi|1858620278|gb|MN628024.1|,132,541,100,132,100.000,2.02e-60,244,82755,"Gymnanthemum amygdalinum voucher PCG/UNN/030-52 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) gene, partial cds; chloroplast"
b875a20e3a4876aba15b0edf8973a3f4,gi|1832942633|gb|MN431198.1|,132,573,100,132,100.000,2.02e-60,244,39414,"Plantago lanceolata ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcl) gene, partial cds; chloroplast"
023abf2ebf1c94fe890dfd1517a828c5,gi|1562068410|gb|MH569150.1|,132,715,98,129,100.000,9.41e-59,239,2508311,"Brassica sp. 4 KS-2019 ribulose-1,5-bisphosphate carboxylase/oxygenase (rbcL) pseudogene, partial sequence; mitochondrial"
Run Code Online (Sandbox Code Playgroud)