我有一个超过1亿行的大型文本文件,名为reads.fastq.此外,我还有另一个文件takeThese.txt,其中包含reads.fastq应该打印的文件中的行号(每行一个).
目前我用
awk 'FNR == NR { h[$1]; next } (FNR in h)' takeThese.txt reads.fastq > subsample.fastq
显然需要很长时间.有没有办法使用存储在另一个文件中的行号从文本文件中提取行?如果takeThese.txt文件被排序,它会加快速度吗?
编辑:
我有几个文件示例行:
reads.fastq:
@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA
takeThese.txt :
5
6
7
8
这样输出看起来像这样:
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
编辑:建议脚本的比较:
$ time perl AndreasWederbrand.pl takeThese.txt reads.fastq  > /dev/null
real    0m1.928s
user    0m0.819s
sys     0m1.100s
$ time ./karakfa  takeThese_numbered.txt reads_numbered.fastq  > /dev/null
real    0m8.334s
user    0m9.973s
sys     0m0.226s
$ time ./EdMorton takeThese.txt reads.fastq  > /dev/null
real    0m0.695s
user    0m0.553s
sys     0m0.130s
$ time ./ABrothers  takeThese.txt reads.fastq  > /dev/null
real    0m1.870s
user    0m1.676s
sys     0m0.186s
$ time ./GlenJackman takeThese.txt reads.fastq  > /dev/null
real    0m1.414s
user    0m1.277s
sys     0m0.147s
$ time ./DanielFischer takeThese.txt reads.fastq  > /dev/null
real    0m1.893s
user    0m1.744s
sys     0m0.138s
感谢您的所有建议和努力!
您的问题中的脚本将非常快,因为它所做的只是对数组中当前行号的哈希查找h.除非您想要从reads.fastq打印最后一个行号,因为它会在打印完最后一个所需的行号后退出,而不是继续读取reads.fastq的其余部分,这样会更快.
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq
你可以在减少数组大小delete h[FNR];之后抛出一个print;,所以MAYBE可以加快查找时间,但是如果由于数组访问是哈希查找而真的会提高性能,那么idk会非常快,所以添加一个delete可能会最终减慢脚本整体下来.
实际上,这将更快,因为它避免了对两个文件中的每一行测试NR == FNR:
awk -v nums='takeThese.txt' '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq
是否更快或者@glennjackman发布的脚本更快取决于takeThese.txt中的行数以及它们发生的reads.fastq的结尾有多接近.由于Glenns读取整个reads.fastq,无论takeThese.txt的内容是什么,它都将在大约恒定的时间内执行,而我的将在读取结束后显着更快.在takeThese.txt中发生最后一个行号.例如
$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
.
$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt
$ time awk -v nums=takeThese.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.720s
user    0m27.876s
sys     0m0.450s
$ time awk -v nums=takeThese.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m50.060s
user    0m47.564s
sys     0m0.405s
.
$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt
$ time awk -v nums=takeThat.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m26.738s
user    0m23.556s
sys     0m0.310s
$ time awk -v nums=takeThat.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.015s
sys     0m0.000s
但你可以充分利用这两个世界:
$ time awk -v nums=takeThese.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.057s
user    0m26.675s
sys     0m0.498s
$ time awk -v nums=takeThat.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.030s
sys     0m0.062s
如果我们假设takeThese.txt已经排序可以简化为:
$ time awk -v nums=takeThese.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m27.362s
user    0m25.599s
sys     0m0.280s
$ time awk -v nums=takeThat.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m0.047s
user    0m0.030s
sys     0m0.016s