为什么我的Perl正则表达式这么慢?

sno*_*kin 2 regex perl

我有以下正则表达式:

my $scores_compiled_regex  = qr{^0
                                  \s+
                                  (\p{Alpha}+\d*)
                                  \s+
                                  (\d+
                                  \s*
                                   \p{Alpha}*)
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}                              
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s+
                                   \d+ #$
                                   }xos
Run Code Online (Sandbox Code Playgroud)

;

它应匹配这些行(来自普通的txt文件):

0            AAS  211    1   1       5       2   6                                                                         15
Run Code Online (Sandbox Code Playgroud)

列名是:

0 INST, NAME             A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS
Run Code Online (Sandbox Code Playgroud)

它意味着:分数A = 1,分数A- = 1,无分数B +,分数B = 5等.我正在尝试将其拆分为列表,而不是忽略空列,它可以工作,但速度很慢,匹配也很慢,我的意思是慢,超过5秒,有时甚至更多!

文件中的前几个文件如下所示:

0 PALMER, JAN            A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS   TOTAL
0            ECON 103   98      35 114   1  14  75           9      35               1          10       1                     
Run Code Online (Sandbox Code Playgroud)

分数是跟随右侧A列的任何内容.

任何的想法?谢谢,

mfo*_*ani 5

看我的计划:

use strict;
use warnings;

# Column details and sample line, from the post
my $header  = q{0 AOZSVIN, TAMSSZ B      A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS};
my $sample  = q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     };
#               -+--------+-----+-----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---..
# chars         1212345678912345612345612341234123412341234123412341234123412341234123412341234123412341234123412341234123412341234...
# num. chars:   2 9        6     6     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   *
my $unpack  = q{A2A9       A6    A6    A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A*};
$unpack =~ s/\s//g;

# Get column names from the "$header" variable above
my @column_names = unpack($unpack, $header);
s/\s+$// for @column_names; # get rid of trailing spaces
s/^\s+// for @column_names; # get rid of leading spaces

# Some sample data in same format, to try the script out
my @samples = (
  q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     },
  q{0            AAS  353    2   3   5   2   6       1                   2                                                     },
  q{0            T304 480M   3  10   8   8   2   3   2   1                                               1               1    },
  q{0            BIOS 206    3  14   5  11   9   8   4   8   3   1   1   6                           7                      },
);

my @big_sample = (@samples) ;#x 200_000;

my @unpacked_data_as_arrayrefs;
m    y @unpacked_data_as_hashrefs;
my $begin = time;
for my $line ( @big_sample ) {
    my @data = unpack($unpack,$line);
    s/\s+$// for @data; # get rid of trailing spaces
    s/^\s+// for @data; # get rid of leading spaces
    push @unpacked_data_as_arrayrefs, [@data]; # stop here if this is all you need
    ## below converts the data in a hash, based on the column names given
    #my %as_hash;
    #for ( 0..$#column_names ) {
    #    $as_hash{ $column_names[$_] } = $data[$_];
    #}
    #push @unpacked_data_as_hashrefs, { %as_hash };
}
my $tot = time - $begin;
print "Done in $tot seconds\n";

# verify all data is as we expected
# uncomment the ones that test hashref, if the above hashref-building code is also uncommented.
{
    use Test::More;
    # first sample
    is($unpacked_data_as_arrayrefs[0]->[2],'AAS'); # AAS in the third column
    is($unpacked_data_as_arrayrefs[0]->[7],'35');  # 35 in the 8th column
    # fourth sample
    is($unpacked_data_as_arrayrefs[3]->[2],'BIOS');
    is($unpacked_data_as_arrayrefs[3]->[15],'6');
    # sixth
    is($unpacked_data_as_arrayrefs[5]->[7],'114');
    is($unpacked_data_as_arrayrefs[5]->[10],'75');
    done_testing();
}
Run Code Online (Sandbox Code Playgroud)

它使用unpack根据字符串中字段的宽度(以字符为单位)将文本拆分为多个块.另请参阅perlpacktut以获取有关如何使用unpack进行此类字符串重整的更多详细信息.对于这种格式,解包可能是最好的,因为它与正则表达式相比表现得非常快(在我的机器上在约6秒内解析600_000个这样的字符串).

如果您需要了解该计划的任何部分,请告诉我.我没有在这里发布,因为它有点偏长(更好的评论而不是!).请告诉我你是否愿意.