小编Dan*_*nny的帖子

如何在列表中的单行上并行运行grep

我是bash的初学者.我需要一些帮助才能提高工作效率.

while read line 
    do
        echo "$line"
        file="Species.$line"
        grep -A 1 "$line" /project/ag-grossart/ionescu/DB/rRNADB/SILVA_123.1_SSURef_one_line.fasta > $file
    done < species1
Run Code Online (Sandbox Code Playgroud)

该文件物种包含约100,000种物种名称.我正在搜索的文件是24 GB fasta(文本)文件.

大文件的格式是:

Domain;Phylum;Class;Order;Family;Genus;Species
Run Code Online (Sandbox Code Playgroud)

AGCT ---- AGCT(每行50,000个字符)

这是物种文件的样本(中间没有空行)

Alkanindiges_illinoisensis
Alkanindiges_sp._JJ005
Alligator_sinensis
Allisonella_histaminiformans
'Allium_cepa'
Alloactinosynnema_album
Alloactinosynnema_sp._Chem10
Alloactinosynnema_sp._CNBC1
Alloactinosynnema_sp._CNBC2
Alloactinosynnema_sp._FMA
Alloactinosynnema_sp._MN08-A0205
Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Run Code Online (Sandbox Code Playgroud)

这是大文件的第一行:

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina.............................................................................-UC-U-G--G-U---------------------------
(this goes one for 50,000 characters per line)
Run Code Online (Sandbox Code Playgroud)

这里有一些标题:

>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
    >AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
Run Code Online (Sandbox Code Playgroud)

我需要每个物种一个包含所有匹配序列的文件.

上面的代码可以工作,但是在16个小时内,它设法完成了不到2000种.

我想并行运行它以加快速度.关于提高搜索效率的任何其他提示也是受欢迎的.

谢谢

bash grep gnu-parallel

6
推荐指数
1
解决办法
450
查看次数

标签 统计

bash ×1

gnu-parallel ×1

grep ×1