使用索引文件从文本文件中打印许多特定行

Question

使用索引文件从文本文件中打印许多特定行

Dan*_*her 6 unix bash awk sed

我有一个超过1亿行的大型文本文件,名为reads.fastq.此外,我还有另一个文件takeThese.txt,其中包含reads.fastq应该打印的文件中的行号(每行一个).

目前我用

awk 'FNR == NR { h[$1]; next } (FNR in h)' takeThese.txt reads.fastq > subsample.fastq

显然需要很长时间.有没有办法使用存储在另一个文件中的行号从文本文件中提取行？如果takeThese.txt文件被排序,它会加快速度吗？

编辑:

我有几个文件示例行:

reads.fastq:

@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA

Run Code Online (Sandbox Code Playgroud)

takeThese.txt :

Run Code Online (Sandbox Code Playgroud)

这样输出看起来像这样:

@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD

Run Code Online (Sandbox Code Playgroud)

编辑:建议脚本的比较:

$ time perl AndreasWederbrand.pl takeThese.txt reads.fastq  > /dev/null

real    0m1.928s
user    0m0.819s
sys     0m1.100s

$ time ./karakfa  takeThese_numbered.txt reads_numbered.fastq  > /dev/null

real    0m8.334s
user    0m9.973s
sys     0m0.226s

$ time ./EdMorton takeThese.txt reads.fastq  > /dev/null

real    0m0.695s
user    0m0.553s
sys     0m0.130s

$ time ./ABrothers  takeThese.txt reads.fastq  > /dev/null

real    0m1.870s
user    0m1.676s
sys     0m0.186s

$ time ./GlenJackman takeThese.txt reads.fastq  > /dev/null

real    0m1.414s
user    0m1.277s
sys     0m0.147s

$ time ./DanielFischer takeThese.txt reads.fastq  > /dev/null

real    0m1.893s
user    0m1.744s
sys     0m0.138s

Run Code Online (Sandbox Code Playgroud)

感谢您的所有建议和努力!

Answer 1

Ed *_*ton 5

您的问题中的脚本将非常快,因为它所做的只是对数组中当前行号的哈希查找h.除非您想要从reads.fastq打印最后一个行号,因为它会在打印完最后一个所需的行号后退出,而不是继续读取reads.fastq的其余部分,这样会更快.

awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

Run Code Online (Sandbox Code Playgroud)

你可以在减少数组大小delete h[FNR];之后抛出一个print;,所以MAYBE可以加快查找时间,但是如果由于数组访问是哈希查找而真的会提高性能,那么idk会非常快,所以添加一个delete可能会最终减慢脚本整体下来.

实际上,这将更快,因为它避免了对两个文件中的每一行测试NR == FNR:

awk -v nums='takeThese.txt' '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq

Run Code Online (Sandbox Code Playgroud)

是否更快或者@glennjackman发布的脚本更快取决于takeThese.txt中的行数以及它们发生的reads.fastq的结尾有多接近.由于Glenns读取整个reads.fastq,无论takeThese.txt的内容是什么,它都将在大约恒定的时间内执行,而我的将在读取结束后显着更快.在takeThese.txt中发生最后一个行号.例如

$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

Run Code Online (Sandbox Code Playgroud)

.

$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt

$ time awk -v nums=takeThese.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.720s
user    0m27.876s
sys     0m0.450s

$ time awk -v nums=takeThese.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m50.060s
user    0m47.564s
sys     0m0.405s

Run Code Online (Sandbox Code Playgroud)

.

$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt

$ time awk -v nums=takeThat.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m26.738s
user    0m23.556s
sys     0m0.310s

$ time awk -v nums=takeThat.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.015s
sys     0m0.000s

Run Code Online (Sandbox Code Playgroud)

但你可以充分利用这两个世界:

$ time awk -v nums=takeThese.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.057s
user    0m26.675s
sys     0m0.498s


$ time awk -v nums=takeThat.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.030s
sys     0m0.062s

Run Code Online (Sandbox Code Playgroud)

如果我们假设takeThese.txt已经排序可以简化为:

$ time awk -v nums=takeThese.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m27.362s
user    0m25.599s
sys     0m0.280s

$ time awk -v nums=takeThat.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m0.047s
user    0m0.030s
sys     0m0.016s

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	226 次
最近记录：	9 年前