如何消除fasta文件中的重复序列

Jam*_*mes 2 bioinformatics biological-neural-network biopython

我正在尝试使用已发布的所有序列来构建数据库细菌类型,以使用 bowtie2 进行映射来计算我对这个数据库的读取覆盖率,为此,我将我从 ncbi 下载的所有基因组序列合并到一个 fasta_library 中(我合并了 74 个文件在 fasta 文件中),问题是在这个 fasta 文件(我创建的库)中我有很多重复的序列,这在很大程度上影响了覆盖率,所以我问是否有任何方法可以消除重复我的 Library_File 中有,或者是否有任何方法可以在没有重复的情况下合并序列,或者是否有任何其他方法可以计算我的读取对参考序列的覆盖率

我希望我足够清楚,如果有什么不清楚的请告诉我。

Ale*_*lds 6

If you have control over your setup, then you could install seqkit and run the following on your FASTA file:

$ seqkit rmdup -s < in.fa > out.fa
Run Code Online (Sandbox Code Playgroud)

If you have multiple files, you can concatenate them and feed them in as standard input:

$ seqkit rmdup -s < <(cat inA.fa ... inN.fa) > out.fa
Run Code Online (Sandbox Code Playgroud)

The rmdup option removes duplicates, and the -s option calls duplicates on the basis of sequence, ignoring differences in headers. I'm not sure which header is kept in the output, but that may be something to think about.

To avoid third-party dependencies and understand how dups are being removed, one can use awk.

The idea is to read all FASTA records one by one into an associative array (or hash table, also called a "dictionary" in Python), only if the sequence is not already in the array.

For example, starting with a single-line FASTA file in.fa that looks like this:

>test1
ATAT
>test2
CGCG
>test3
ATAT
>test4
GCCT
Run Code Online (Sandbox Code Playgroud)

We can remove duplicates, preserving the first header, like so:

$ awk 'BEGIN {i = 1;} { if ($1 ~ /^>/) { tmp = h[i]; h[i] = $1; } else if (!a[$1]) { s[i] = $1; a[$1] = "1"; i++; } else { h[i] = tmp; } } END { for (j = 1; j < i; j++) { print h[j]; print s[j]; } }' < in.fa > out.fa
$ cat out.fa
>test1
ATAT
>test2
CGCG
>test4
GCCT
Run Code Online (Sandbox Code Playgroud)

It requires a little knowledge about awk if you need modifications. This approach also depends on how your FASTA files are structured (records with sequences on one line or multiple lines, etc.), though it is usually pretty easy to modify FASTA files into the above structure (one line each for header and sequence).

Any hash table approach also uses a fair bit of memory (I imagine that seqkit probably makes the same compromise for this particular task, but I haven't looked at the source). This could be an issue for very large FASTA files.

It's probably better to use seqkit if you have a local environment on which you can install software. If you have an IT-locked-down setup, then awk would work for this task, as well, as it comes with most Unixes out of the box.