Unix / Perl / Python:大数据集的替代列表

Sub*_*beh 6 python perl awk sed large-data

I've got a mapping file of about 13491 key/value pairs which I need to use to replace the key with the value in a data set of about 500000 lines divided over 25 different files.

Example mapping: value1,value2

Example input: field1,field2,**value1**,field4

Example output: field1,field2,**value2**,field4

Please note that the value could be in different places on the line with more than 1 occurrence.

My current approach is with AWK:

awk -F, 'NR==FNR { a[$1]=$2 ; next } { for (i in a) gsub(i, a[i]); print }' mapping.txt file1.txt > file1_mapped.txt

However, this is taking a very long time.

Is there any other way to make this faster? Could use a variety of tools (Unix, AWK, Sed, Perl, Python etc.)

zdi*_*dim 6

Note   See the second part for a version that uses Text::CSV module to parse files


Load mappings into a hash (dictionary), then go through your files and test each field for whether there is such a key in the hash, replace with value if there is. Write each line out to a temporary file, and when done move it into a new file (or overwrite the processed file). Any tool has to do that, more or less.

With Perl, tested with a few small made-up files

use warnings;
use strict;
use feature 'say';

use File::Copy qw(move);

my $file = shift;
die "Usage: $0 mapping-file data-files\n"  if not $file or not @ARGV;

my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) { 
    my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;  # see Notes
    $map{$key} = $val;
}

my $outfile = "tmp.outfile.txt.$$";  # use File::Temp

foreach my $file (@ARGV) {
    open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
    open my $fh,     '<', $file    or die "Can't open $file: $!";
    while (<$fh>) {
        s/^\s+|\s+$//g;               # remove leading/trailing whitespace
        my @fields = split /\s*,\s*/;
        exists($map{$_}) && ($_=$map{$_}) for @fields;  # see Notes
        say $fh_out join ',', @fields;
    }   
    close $fh_out;

    # Change to commented out line once thoroughly tested
    #move($outfile, $file) or die "can't move $outfile to $file: $!";
    move($outfile, 'new_'.$file) or die "can't move $outfile: $!";
}
Run Code Online (Sandbox Code Playgroud)

Notes.

  • The check of data against mappings is written for efficiency: We must look at each field, there's no escaping that, but then we only check for the field as a key (no regex). For this all leading/trailing spaces need be stripped. Thus this code may change whitespace in output data files; in case this is important for some reason it can of course be modified to preserve original spaces.

  • It came up in comments that a field in data can differ in fact, by having extra quotes. Then extract the would-be key first

    for (@fields) {
        $_ = $map{$1}  if /"?([^"]*)/ and exists $map{$1};
    }
    
    Run Code Online (Sandbox Code Playgroud)

    This starts the regex engine on every check, what affects efficiency. It would help to clean up that input CSV data of quotes instead, and run with the code as it is above, with no regex. This can be done by reading files using a CSV-parsing module; see comment at the end.

  • For Perls earlier than 5.14 replace

    my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;
    
    Run Code Online (Sandbox Code Playgroud)

    with

    my ($key, $val) = map { s/^\s+|\s+$//g; $_ } split /\s*,\s*/;
    
    Run Code Online (Sandbox Code Playgroud)

    since the "non-destructive" /r modifier was introduced only in v5.14

  • If you'd rather that your whole operation doesn't die for one bad file, replace or die ... with

    or do { 
        # print warning for whatever failed (warn "Can't open $file: $!";)
        # take care of filehandles and such if/as needed
        next;
    };
    
    Run Code Online (Sandbox Code Playgroud)

    and make sure to (perhaps log and) review output.

This leaves room for some efficiency improvements, but nothing dramatic.


The data, with commas separating fields, may (or may not) be valid CSV. Since the question doesn't at all address this, and doesn't report problems, it is unlikely that any properties of the CSV data format are used in data files (delimiters embedded in data, protected quotes).

However, it's still a good idea to read these files using a module that honors full CSV, like Text::CSV. That also makes things easier, by taking care of extra spaces and quotes and handing us cleaned-up fields. So here's that -- the same as above, but using the module to parse files

use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);

use Text::CSV;

my $file = shift;
die "Usage: $0 mapping-file data-files\n"  if not $file or not @ARGV;

my $csv = Text::CSV->new ( { binary => 1, allow_whitespace => 1 } ) 
    or die "Cannot use CSV: " . Text::CSV->error_diag ();

my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
    $map{ $line->[0] } = $line->[1]
}

my $outfile = "tmp.outfile.txt.$$";  # use File::Temp    

foreach my $file (@ARGV) {
    open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
    open my $fh,     '<', $file    or die "Can't open $file: $!";
    while (my $line = $csv->getline($fh)) {
        exists($map{$_}) && ($_=$map{$_}) for @$line;
        say $fh_out join ',', @$line;
    }
    close $fh_out;

    move($outfile, 'new_'.$file) or die "Can't move $outfile: $!";
}
Run Code Online (Sandbox Code Playgroud)

Now we don't have to worry about spaces or overall quotes at all, what simplifies things a bit.

While it is difficult to reliably compare these two approaches without realistic data files, I benchmarked them for (made-up) large data files that involve "similar" processing. The code using Text::CSV for parsing runs either around the same, or (up to) 50% faster.

The constructor option allow_whitespace makes it remove extra spaces, perhaps contrary to what the name may imply, as I do by hand above. (Also see allow_loose_quotes and related options.) There is far more, see docs. The Text::CSV defaults to Text::CSV_XS, if installed.


Ed *_*ton 5

您在gsub()500,000 条输入行中的每一行都执行了 13,491秒 - 这几乎是 70 亿次全行正则表达式搜索/替换总数。所以是的,这需要一些时间,而且几乎肯定会以您没有注意到的方式破坏您的数据,因为下一个 gsub() 更改了一个 gsub() 的结果和/或您得到了部分替换!

我在评论中看到您的某些字段可以用双引号括起来。如果这些字段不能包含逗号或换行符,并假设您想要完整的字符串匹配,那么编写方法如下:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    map[$1] = $2
    map["\""$1"\""] = "\""$2"\""
    next
}
{
    for (i=1; i<=NF; i++) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}
Run Code Online (Sandbox Code Playgroud)

我在一个包含 13,500 个条目的映射文件和一个 500,000 行的输入文件上测试了上述内容,并且在我动力不足的笔记本电脑上的 cygwin 中的大多数行中有多个匹配项,它在大约 1 秒内完成:

$ wc -l mapping.txt
13500 mapping.txt

$ wc -l file500k
500000 file500k

$ time awk -f tst.awk mapping.txt file500k > /dev/null
real    0m1.138s
user    0m1.109s
sys     0m0.015s
Run Code Online (Sandbox Code Playgroud)

如果这不能完全有效地满足您的要求,请编辑您的问题以提供MCVE和更清晰的要求,请参阅我在您的问题下的评论

  • 谢谢。我发现硬编码 `"\""` 更具可读性,尤其是在其他一些上下文中(我喜欢坚持一种做这样的事情的方式,以及 `\047` vs `'` vs `quote="'"`)这样的作为`$0 ~ "\"foo\""` vs `$0 ~ quote foo quote`,我什至不确定后者是否会被解释为`($0 ~ quote) foo quote`或`$0 ~ (quote foo报价)`但我想是YMMV。 (2认同)
  • 啊,我弄错了代码,它应该是`$0 ~ quote "foo" quote`(显然我不仅很难阅读它,而且我也不会写它:-))但你明白了。 (2认同)