Perl脚本比grep -f更快

Red*_*sey 1 perl grep

我正在调整这里提出的现有perl脚本: grep -f的快速替代

我需要过滤许多非常大的文件(Map文件),每个文件大约1000万行x 5个字段宽,使用一个长列表(过滤文件)和匹配的地图文件中的打印行.我尝试使用grep -f,但它只是花了太长时间.我读到这种方法会更快.

这就是我的文件的样子:

过滤文件:

DB775P1:276:C2R0WACXX:2:1101:10000:77052
DB775P1:276:C2R0WACXX:2:1101:10003:51920
DB775P1:276:C2R0WACXX:2:1101:10004:36433
DB775P1:276:C2R0WACXX:2:1101:10004:57256
Run Code Online (Sandbox Code Playgroud)

地图文件:

DB775P1:276:C2R0WACXX:2:1101:10000:70401     chr5    21985760    21985780    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     + 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    + 
DB775P1:276:C2R0WACXX:2:1101:10002:88487     chr17   5681227     5681249     - 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr13   2569168     2569185     + 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr14   13253418    13253435    - 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr14   13266344    13266361    -
Run Code Online (Sandbox Code Playgroud)

我希望输出行看起来像这样,因为它们包含map和filter文件中的字符串.

DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     + 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    + 
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止编辑的脚本,但没有运气:

#!/usr/bin/env perl
use strict;
use warnings;

# Load the files
my $filter = $ARGV[0];
my $sam = $ARGV[1];
open FILE1, $filter;
   if (! open FILE1, $filter) {die "Can't open filterfile: $!";}
open FILE2, $sam;
   if (! open FILE2, $sam) {die "Can't open samfile: $!";}

# build hash of keys using lines from the filter file
my $lines;
my %keys
while (<FILE1>) {
   chomp $lines;
   %keys = $lines;
}
close FILE1;

# look up keys in the map file, if match, print line in the map file.
my $samlines;
while (<FILE2>) {
   chomp $samlines;
   my ($id, $chr, $start, $stop, $strand)  = split /\t/, $samline;
   if (defined $lines->{$id}) { print "$samline \n"; }
}
Run Code Online (Sandbox Code Playgroud)

Bor*_*din 5

您似乎没有真正尝试过自己解决这个问题.您显示的代码甚至不会编译

有几个原因导致它不起作用

  • 您正在使用带有隐式控制变量的文件读取循环,这些变量读取每一行$_,但您在某种程度上期望数据出现在$lines和中$samlines.你也在使用$samline你甚至没有申报的

  • 这条线

    my %keys
    
    Run Code Online (Sandbox Code Playgroud)

    最后需要一个分号

  • 我不知道你期望什么$lines,但是为这样的哈希分配标量值

    %keys = $lines;
    
    Run Code Online (Sandbox Code Playgroud)

    在哈希赋值中产生警告奇数个元素,并为您留下只有一个元素的哈希值

这是一个Perl程序,可以完成我认为你的意图,但我不能说它是否会比command_line快得多grep.请注意,我使用了autodiepragma而不是显式测试每个文件IO操作的状态

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10.1;
use autodie;

my ($filter_f, $sam_f) = @ARGV;

my %filter;

{
    open my $fh, '<', $filter_f;

    while ( <$fh> ) {
        $filter{$1} = 1 if /(\S+)/;
    }
}

open my $fh, '<', $sam_f;

while ( <$fh> ) {
    print if /(\S+)/ and $filter{$1};
}
Run Code Online (Sandbox Code Playgroud)

产量

DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     +
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    +
Run Code Online (Sandbox Code Playgroud)