如果文件中的项目顺序发生变化，则正则表达式不匹配

Question

如果文件中的项目顺序发生变化，则正则表达式不匹配

这是我第一次尝试 Perl，所以我知道这段代码很难看。其中一些来自不知道我在做什么，一些来自解决各种问题。我想要做的是搜索文件（samplefile.txt）以获取各种信息（9 个 parse_updates 函数），除非顺序发生变化，否则它可以正常工作。例如，如果一个样本文件在僵尸网络定义之前有证书包，那么它将无法找到证书包信息。我希望每个函数都开始搜索“新鲜”的示例文件，但情况似乎并非如此，我不知道为什么。不包括示例文件，因为代码帖子已经足够长，我认为问题出在我的函数逻辑中。


use strict;
use warnings;
use diagnostics;
use File::Slurp;
my @autoupdate;
my $autoupdate;

my $av_regex;
my @av_updates;

my $avdev_regex;
my @avdef_updates;

my $ipsatt_regex;
my @ipsatt_updates;

my $attdef_regex;
my @attdef_updates;


my $ipsmal_regex;
my @ipsmal_updates;

my $flowav_regex;
my @flowav_updates;

my $botnet_regex;
my @botnet_updates;

my $appdef_regex;
my @appdef_updates;

my $ipgeo_regex;
my @ipgeo_updates;

my $certbun_regex;
my @certbun_updates;

my $str1;
my $str2;
my $str3;
my $str4;
my $str5;
my $str6;
my $str7;
my $str8;
my $str9;


 
parse_updates1(); #AV Engine
parse_updates2(); #Virus Defs
parse_updates3(); #IPS Attack Engine
parse_updates4(); #Attack Defs
parse_updates5(); #IPS Mal URL DB
parse_updates6(); #Flow virus Defs
parse_updates7(); #Botnet Defs
parse_updates8(); #IP Geo DB
parse_updates9(); #Cert Bundle


sub parse_updates1{
print "\nTHIS IS AV Engine Section!!\n\n";
read_file('samplefile.txt', buf_ref => \$str1);


my $av_regex =qr/(AV Engine)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;
if ( $str1 =~ /$av_regex/g ) {
  #putting each regex group into the array
  push @av_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds
  chomp @av_updates;

     print "$_\n" for @av_updates;

}
else {
  print "\n\nGot Nothing!\n\n";
  @av_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @av_updates;

}
}
sub parse_updates2{
read_file('samplefile.txt', buf_ref => \$str2);

print "\nTHIS IS Virus Definitions Section!!\n\n";

my $avdef_regex =qr/(Application Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str2 =~ /$avdef_regex/g ) {
  #putting each regex group into the array
  push @avdef_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @avdef_updates;
 
print "$_\n" for @avdef_updates;

}
else {
  print "\n\nGot Nothing!\n\n";
  @avdef_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @avdef_updates;


}
}
sub parse_updates3{
read_file('samplefile.txt', buf_ref => \$str3);

 print "\nTHIS IS IPS Attack Engine Section!!\n\n";

my $ipsatt_regex =qr/(IPS Attack Engine)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str3 =~ /$ipsatt_regex/g ) {
  #putting each regex group into the array
  push @ipsatt_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipsatt_updates;
 

print "$_\n" for @ipsatt_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipsatt_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipsatt_updates;

}
}
sub parse_updates4{
read_file('samplefile.txt', buf_ref => \$str4);

 print "\nTHIS IS Attack Definitions Section!!\n\n";

my $attdef_regex =qr/(Attack Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str4 =~ /$attdef_regex/g ) {
  #putting each regex group into the array
  push @attdef_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @attdef_updates;
 

print "$_\n" for @attdef_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @attdef_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @attdef_updates;

}
}
sub parse_updates5{
read_file('samplefile.txt', buf_ref => \$str5);

print "\nTHIS IS IPS Malicious URL Database Section!!\n\n";

my $ipsmal_regex =qr/(IPS Malicious URL Database)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str5 =~ /$ipsmal_regex/g ) {
  #putting each regex group into the array
  push @ipsmal_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipsmal_updates;
 

print "$_\n" for @ipsmal_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipsatt_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipsatt_updates;

}
}
sub parse_updates6{
read_file('samplefile.txt', buf_ref => \$str6);

print "\nTHIS IS Flow-Based Virus Definitions Section!!\n\n";

my $flowav_regex =qr/(Flow-based Virus Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str6 =~ /$flowav_regex/g ) {
  #putting each regex group into the array
  push @flowav_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @flowav_updates;
 

print "$_\n" for @flowav_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @flowav_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @flowav_updates;

}
}
sub parse_updates7{
read_file('samplefile.txt', buf_ref => \$str7);

print "\nTHIS IS Botnet Definitions Section!!\n\n";

my $botnet_regex =qr/(Botnet Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str7 =~ /$botnet_regex/g ) {
  #putting each regex group into the array
  push @botnet_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @botnet_updates;
 

print "$_\n" for @botnet_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @botnet_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @botnet_updates;

}
}
sub parse_updates8{
read_file('samplefile.txt', buf_ref => \$str8);


print "\nTHIS IS IP geography DB Section!!\n\n";

my $ipgeo_regex =qr/(IP Geography DB)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str8 =~ /$ipgeo_regex/g ) {
  #putting each regex group into the array
  push @ipgeo_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipgeo_updates;
 

print "$_\n" for @ipgeo_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipgeo_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipgeo_updates;

}
}
sub parse_updates9{
read_file('samplefile.txt', buf_ref => \$str9);


print "\nTHIS IS Certificate Bundle Section!!\n\n";

my $certbun_regex =qr/(Certificate Bundle)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str9 =~ /$certbun_regex/g ) {
  #putting each regex group into the array
  push @certbun_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @certbun_updates;
 

print "$_\n" for @certbun_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @certbun_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @certbun_updates;

}


# End of sub parse_updates
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

zdi*_*dim 5

尽管在没有看到一些数据的情况下无法明确回答这个问题，但我想先提供对该程序的重写。这也可以解决问题。

所有这些功能都没有理由；他们都做的完全一样。也不需要变量的海洋；散列适用于命名事物的集合。我至少保留了一些原始选择，例如整体流程、使用File::Slurp等。

use warnings;
use strict;
use feature 'say';    

use Data::Dump qw(dd);
use File::Slurp;

my $fname = shift // die "Usage: $0 file\n";   #/

my %update = (
    av => { 
        re => qr/pattern-for-av/,
        name => q(AV Engine Section),
    },
    avdev => { 
        re => qr/pattern-for-avdev/, 
        name => q(Virus Definitions Section),
    },
    # ...
);

my $file_content = read_file($fname);

foreach my $code (sort keys %update) {
    say "This is $update{$code}{name}";
    my $captures = parse_update( $file_content, $update{$code}{re} );
    $update{$code}{captures} = $captures;
}    
dd \%update;

sub parse_update {
    my ($file_content, $re) = @_;

    my @captures = $file_content =~ /$re/;  
    if (not @captures) {
        say "Got nohting!";
        @captures = ( 'notfound' ) x 12;  # apparently exactly 12
    }
    else { chomp @captures }

    say for @captures;  

    return \@captures;
}

Run Code Online (Sandbox Code Playgroud)

正则表达式模式和部分名称都在 hash 中%update，然后添加结果（捕获）。这种数据组织的选择有点武断，因为我不知道上下文。

文件打开一次，其所有内容重复复制到子文件。请根据需要进行调整。例如，如果文件很大，还有其他方法可以使 sub 可以使用该数据。

That if (/.../g)，在问题中使用，偶尔会看到，毫无意义，很容易出错——也可能导致问题中描述的那种问题。^†在标量上下文中使用时，/g修饰符可满足复杂的需求，而不是用于单独的if语句。

成功匹配（从而捕获）的条件取自问题。子中的代码可以以多种其他方式组织，从更紧凑到更精细。

请注意，sub 不直接使用来自更高范围的任何内容；它需要的所有内容都显式传递给它，并返回其结果。这非常重要，因此为了避免耦合本意是不同的代码组件（这里是 sub 及其调用者）；它们甚至可以驻留在不同的编译单元中。

这次重写很可能已经发现了错误并解决了问题；或者它可能没有。如果我们能看到数据样本，那么更有针对性的故障排除可能是可能的。

上面的代码已经过测试，有一个虚构的文件和合适的正则表达式模式。

^†虽然我需要查看一些数据来确定导致报告行为的原因，但一个很好的候选者是毫无戒心地使用if (/.../g). 该修饰符使正则表达式记住它匹配的位置，下次在同一字符串上调用正则表达式时，它开始从前一个匹配字符串中的位置查找匹配项。

一个简单的例子

use warnings; use strict; use feature 'say';

my $s = q(one simple string); 

if ($s =~ /(\w+)/g) { say $1 }; 
if ($s =~ /(\w+)/g) { say $1 }; 
say pos($s);

Run Code Online (Sandbox Code Playgroud)

哪个打印

一
简单的
10

其中最后一行是正则表达式跟踪的该点字符串中的位置；在第二场比赛之后。（pos函数非常适合查看正则表达式操作中发生的一些事情。）因此，在匹配后再次调用时，正则表达式会从它停止的地方继续，由/g修饰符提供；如果没有它，则在新调用中从头开始扫描字符串。

另一个例子，重复执行单个表达式

use warnings; use strict; use feature 'say';

my $s = q(one two);

sub func { say $1 if $_[0] =~ /(\w+)/g };  # /g is of consequence!

for (1..4) { func($s) }

Run Code Online (Sandbox Code Playgroud)

这打印

一
二

它完成了；没有了。那是因为引擎two在第二次匹配中超过了单词，因此在for循环的下一次迭代中没有任何匹配。

有关上述示例及其上下文的更多信息，请参阅这篇文章和这篇文章。

特别是第二个例子与问题中给出的非常相似。

上面的一些行为可以通过锚点和其他修饰符来修改，/g当然这很有用——但需要知道它的作用。

归档时间：	5 年，6 月前
查看次数：	91 次
最近记录：	5 年，6 月前