我在perl中开发了一个比较器工具,它从2个CSV文件中获取数据,并根据一组唯一键进行比较,并生成一个excel报告.只要我有20k-30k行,每个CSV文件有10列,这个工具工作得很好,但比较算法所花费的时间过高,目前我在每个CSV文件中运行98k行并且它已经被现在跑了4个多小时,我觉得还有更长的时间.
我使用的算法是:
以下是我的比较代码:
# Actual comparison
# Matched rows will go to @finalMatchedArray1 and @finalMatchedArray2
# Unmatched arrays will stay in the original arrays
for ( my $j=0; $j<=$#matchArray1; $j++ ) {
my @result = map {$_}
grep { "@{ $matchArray2[$_] }[@mapKeyArray2]" eq "@{ $matchArray1[$j] }[@mapKeyArray1]" }
0 .. $#matchArray2;
if ( defined $result[0] ) {
push @finalMatchedArray1,$matchArray1[$j];
push @finalMatchedArray2,$matchArray2[$result[0]];
splice ( @matchArray2,$result[0],1 );
splice ( @matchArray1,$j,1 );
$j--;
}
}
Run Code Online (Sandbox Code Playgroud)
以下是数据:
$matchArray1 = [[qw(a b c)], [qw(d e f)], [qw(g h i)]];
$matchArray2 = [[qw(d e f)], [qw(g k i)], [qw(a b c)]];
$mapKeyArray1 = [1,2];
$mapKeyArray2 = [1,2]; #Can be different in real example
Run Code Online (Sandbox Code Playgroud)
这个代码适用于少于20k行的完美正常,我需要在更高的容量情况下提高性能(在这种情况下每个CSV中98k行)
请提供任何指针,代码更改可能会帮助我加快匹配过程.
您正在搜索另一个数组的每个元素的一个数组的每个元素.对于可以在O(N)中完成的事情来说,这是O(N 2).
一般来说,
for my $ele_a (@a) {
for my $ele_b (@b) {
if (generate_key($ele_a) eq generate_key($ele_b)) {
...
}
}
}
Run Code Online (Sandbox Code Playgroud)
可写成
my %b;
for my $ele_b (@b) {
$b{ generate_key($ele_b) } = $ele_b;
}
for my $ele_a (@a) {
if (exists($b{generate_key($ele_b)}) {
if (generate_key($ele_a) eq generate_key($ele_b)) {
...
}
}
}
Run Code Online (Sandbox Code Playgroud)
我们可以在这里申请.
my @file1_key_idxs = ...; # Indexes of key fields.
my @file2_key_idxs = ...; # Indexes of key fields.
my @file1_data_idxs = ...; # Indexes of fields to compare.
my @file2_data_idxs = ...; # Indexes of fields to compare.
my @matches; # Array of [ \@keys, \@data ]
my @non_matches; # Array of [ \@keys, \@file1_data, \@file2_data ]
my @file1_adds; # Array of rows from file1 not in file2.
my @file2_adds; # Array of rows from file2 not in file1.
open(my $fh1, '<:encoding(...)', ...) or die $!;
open(my $fh2, '<:encoding(...)', ...) or die $!;
my $csv = Text::CSV_XS->new({ binary => 1 });
my %file2;
while ( my $file2_row = $csv->getline($fh2) ) {
my @key = @{ $file2_row }[ @file2_key_idxs ];
my $key = pack("(N/a*)*", @key;
$file2{$key} = $file2_row;
}
Run Code Online (Sandbox Code Playgroud)
while ( my $file1_row = $csv->getline($fh1) ) {
my @key = @{ $file1_row }[ @file1_key_idxs ];
my $key = pack("(N/a*)*", @key);
if (defined(my $file2_row = $file2{$key})) {
my @file1_data = @{ $file1_row }[ @file1_data_idxs ];
my $file1_data = pack("(N/a*)*", @file1_data);
my @file2_data = @{ $file2_row }[ @file2_data_idxs ];
my $file2_data = pack("(N/a*)*", @file2_data);
if ($file1_data eq $file2_data) {
push @matches, [ \@key, \@file1_data ];
} else {
push @non_matches, [ \@key, \@file1_data, \@file2_data ];
}
delete $file2{$key};
} else {
push @file1_adds, $file1_row;
}
}
while (my $key = each(%file2)) {
my $file2_row = delete($file2{$key});
push @file2_adds, $file2_row;
}
Run Code Online (Sandbox Code Playgroud)
如果其中一个文件比另一个文件小,那么它应该是file2以节省内存并加快速度.
如果内存不是问题,可以使用以下内容替换最后一个循环以加快速度:
@file2_adds = values(%file2);
Run Code Online (Sandbox Code Playgroud)