jac*_*ack 2 awk bioinformatics
list.txt
:
58759__len__2903
58759__len__2903
673957__len__1655
673957__len__1655
3566454__len__1744
Run Code Online (Sandbox Code Playgroud)
seq.fasta
:
>58759__len__2903
TTTTCCGTAGAGGAGATCCCTATTTTTAGGTTTGTAAGAGATCATTTT
>67777__len__2978
TTTTTAGGTTTGTAAGACCGTAGAG
>673957__len__1655
CCCTATTTTTAGGTTTGTAAGGTTTGTAAGACCGTAGAG
>3566454__len__1744
GGTTTGTAAGACCGTAGAGGGTTTGTAAGACCGTAGAG
Run Code Online (Sandbox Code Playgroud)
output.fasta
:
>58759__len__2903
TTTTCCGTAGAGGAGATCCCTATTTTTAGGTTTGTAAGAGATCATTTT
>673957__len__1655
CCCTATTTTTAGGTTTGTAAGGTTTGTAAGACCGTAGAG
>3566454__len__1744
GGTTTGTAAGACCGTAGAGGGTTTGTAAGACCGTAGAG
Run Code Online (Sandbox Code Playgroud)
匹配行list.txt
(如果重复行,仅使用唯一行)到seq.fasta
FASTA 文件并提取输出文件中所示的文件。
您展示的简单案例是微不足道的。您的序列永远不会超过一行,因此您可以简单地使用grep
搜索您的每个 ID 及其后的行:
grep -Fwf list.txt -A 1 seq.fasta | grep -v '^--$' > out.fasta
Run Code Online (Sandbox Code Playgroud)
在grep -v '^--$'
简单地过滤掉与所述线--
即grep
使用时的输出线组之间的补充-A
选项。
为了避免受骗,您可以通过(GNU)排序传递您的列表:
grep -Fwf <(sort -u list.txt) -A 1 seq.fasta | grep -v '^--$' > out.fasta
Run Code Online (Sandbox Code Playgroud)
使用的标志是:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
Run Code Online (Sandbox Code Playgroud)
但是,在大多数情况下,您的序列将是几行,这还不够。如果你经常做这种事情,我建议你安装exonerate
工具套件。它们通常对生物信息学工作非常有用,并且包括一个很好的工具fastafetch
,它被设计用来做你想做的事:
安装免责套件。这在基于 Debian 的系统的存储库中,也可以从这里获得。
sudo apt-get install exonerate
Run Code Online (Sandbox Code Playgroud)
为您的 fasta 文件创建索引。这用于快速检索序列。
fastaindex seq.fasta seq.idx
Run Code Online (Sandbox Code Playgroud)
提取您的序列:
$ fastafetch -f seq.fasta -i seq.idx -Fq <(sort -u list.txt )
>3566454__len__1744
GGTTTGTAAGACCGTAGAGGGTTTGTAAGACCGTAGAG
>58759__len__2903
TTTTCCGTAGAGGAGATCCCTATTTTTAGGTTTGTAAGAGATCATTTT
>673957__len__1655
CCCTATTTTTAGGTTTGTAAGGTTTGTAAGACCGTAGAG
Run Code Online (Sandbox Code Playgroud)