从大量文本文件中提取的字符串的高效缓存解决方案

Question

从大量文本文件中提取的字符串的高效缓存解决方案

对于目录中的一堆文本文件（都非常小，大约 100 行），我需要构建一些字符串，然后将所有内容通过管道传输到其中，fzf以便用户可以选择一个文件。字符串本身取决于文件的前几行 (~20) 行，并使用几个非常简单的正则表达式模式构建。在连续调用之间，预计只有少数文件会发生变化。我正在寻找某种方法来做到这一点，而没有明显的延迟（对于用户而言）大约 50k 个文件。

这是我到目前为止所做的：我对此的第一个解决方案是一个简单的 shell 脚本，即：

cat $dir/**/* | $process_script | fzf

其中 $process_script 是一些 Perl 脚本，它逐行读取每个文本文件，直到它构建了所需的字符串，然后将其打印出来。已经有 1000 个文件要处理，这个脚本不再可用，因为它需要大约两秒钟，因此会给用户带来明显的延迟。所以我通过将字符串存储在一些文本文件中来实现一个穷人的缓存，然后只更新那些实际更改的行（基于文件的 mtime）。新脚本大致执行以下操作：

$find_files_with_mtime_newer_than_last_script_run | $process_script | fzf

Run Code Online (Sandbox Code Playgroud)

其中 $find_files_with_mtime_newer_than_last_script_run 运行fd（快速查找替换）并且 $process_script 是以下形式的 Perl 脚本

my $cache = slurp($cachefile); #read lines of cachefile into multiline string
my ($string,$id);

while (<>) {

      ($string, $id) = build_string($_); #open file and build string

      $cache = s/^.*$id.*\n//; #delete old string from cache

      $cache = $cache . $string; #insert updated string into cache

}

print $cache;

spew($cache, $cachefile); #write cachefile

spew(printf('%s', time),$mtimefile); #store current mtime

Run Code Online (Sandbox Code Playgroud)

在这里，slurp，spew并build_string执行评论中所写的内容。现在，这个解决方案足够快，用户不会注意到任何延迟，但我怀疑当文件数量增加时，这会再次改变。

我的问题如上所述，我正在寻找某种方法来加快此任务的速度。特别是，您能否评论以下策略是否应导致可接受的（即小于一秒）运行时间：

将纯文本缓存文件替换为 SQLite 文件（或类似的文件），将构建的字符串与相应的文件名和上次处理时间一起存储，然后将当前时间传递给脚本，直接提取所有需要更新的文件从 SQLite 不使用 find 或fd并行处理那些需要使用 gnu 并行更新的文件。

当然，我也会非常感谢不同的解决方案。

Answer 1

zdi*_*dim 5

注意 第一部分是使用缓存文件的方法，第二部分是使用的方法sqlite，然后是两者之间的比较。

当然，任何一种解决方案是否“足够快”完全取决于所有这些数字。采取的最佳方法也是如此。

对于您显示的内容-很少更改的小文件-基础知识应该足够好

use warnings;
use strict;
use feature 'say';

my $fcache = 'cache.txt';  # format: filename,epoch,processed_string

open my $fh, '<', $fcache or die "Can't open $fcache: $!";
my %cache = map { chomp; my @f = split /,/, $_, 3;  shift @f => \@f } <$fh>; #/
close $fh;

for (@ARGV) {
    my $mtime = (stat)[9];

    # Have to process the file (and update its record)
    if ( $cache{$_}->[0] < $mtime ) { 
        @{$cache{$_}} = ($mtime, proc_file($_));
    }   

    say $cache{$_}->[1];
}

# Update the cache file
open my $fh_out, '>', $fcache or die "Can't open $fcache: $!";
say $fh_out join(',', $_, @{$cache{$_}}) for keys %cache;

sub proc_file {  # token processing: join words with _
    my $content = do { local (@ARGV, $/) = $_[0]; <> };
    return join '_', split ' ', $content;
}

Run Code Online (Sandbox Code Playgroud)

笔记

由于使用了哈希，这不会保留缓存中记录的顺序，这似乎无关紧要。如果需要，那么您需要知道（记录）现有的行顺序，然后在写入之前进行排序
作为示例，选择“缓存”文件的确切结构和程序中使用的数据结构有点随意。一定要改善这一点
必须已经存在缓存文件才能使上述内容正常工作，其格式在注释中给出：filename,seconds-since-epoch,string. 如果它不存在，则添加代码以编写它
这里最大的消费者是从 50k 行文件填充复杂数据结构的行。只要文件很小并且只有少数需要处理，这应该是最耗时的部分

我想说的是，sqlite对于这样一个小问题，涉及主要会增加开销。

如果每次要处理的文件数量超过少数，那么您可能想并行尝试它——考虑到它们有多小，大部分时间都花在访问文件的开销上，而且那里可能有足够的“肘部空间”所以要从并行处理中获益。此外，通常 I/O 当然可以通过并行运行来加速，但这完全取决于具体情况。

我认为这是与比较的完美案例sqlite，因为我不确定会发生什么。

首先，我将 50,000 个小文件 ( a N b) 写入一个单独的目录 ( dir)

perl -wE'for (1..50_000) { open $fh, ">dir/f$_.txt"; say $fh "a $_ b" }'

Run Code Online (Sandbox Code Playgroud)

（通常总是使用三参数open！）这在我的旧笔记本电脑上花了 3 秒钟。

现在我们需要sqlite用这些文件构建一个缓存文件和一个 ( ) 数据库，然后更新其中的一些，然后比较使用程序sqlite和缓存文件的处理。

首先是使用sqlite.

在文件中创建并填充数据库 files.db

use warnings;
use strict;
use feature 'say';    
use DBI;

my ($dir, $db) = ('dir', 'files.db');
my $dbh = DBI->connect("DBI:SQLite:dbname=$db", '', '', { RaiseError => 1 });

my $table = 'files';
my $qry = qq( create table $table (
    fname   text     not null unique,
    mtime   integer  not null,
    string  text
); );
my $rv = $dbh->do($qry);

chdir $dir or die "Can't chdir to $dir: $!";    
my @fnames = glob "*.txt";

# My sqlite doesn't accept much past 500 rows in single insert (?)
# The "string" that each file is digested into: join words with _
my $tot_inserted = 0;
while (my @part = splice @fnames, 0, 500) {
    my @vals;
    for my $fname ( @part ) { 
        my $str = join '_', 
            split ' ', do { local (@ARGV, $/) = $fname; <> };
        push @vals, "('$fname'," . (stat $fname)[9] . ",'$str')";
    }   
    my $qry = qq(insert into $table (fname,mtime,string) values ) 
            . join ',', @vals;

    $tot_inserted += $dbh->do($qry);
}
say "Inserted $tot_inserted rows";

Run Code Online (Sandbox Code Playgroud)

这大约需要 13 秒，这是一次性费用。我一次insert500 行，因为我sqlite不会让我做更多；我不知道为什么会这样（我PostgreSQL在一个插入语句中推到了几百万行）。具有unique约束上一列得到它索引。

现在我们可以更改一些时间戳

touch dir/f[1-9]11.txt

Run Code Online (Sandbox Code Playgroud)

然后运行一个程序来更新sqlite这些更改的数据库

use warnings;
use strict;
use feature 'say';    
use DBI;    
use Cwd qw();
use Time::HiRes qw(gettimeofday tv_interval);

my $time_beg = [gettimeofday];

my ($dir, $db) = ('dir', 'files.db');
die "No database $db found\n" if not -f $db;    
my $dbh = DBI->connect("DBI:SQLite:dbname=$db", '', '', { RaiseError => 1 });

# Get all filenames with their timestamps (seconds since epoch)
my $orig_dir = Cwd::cwd;
chdir $dir or die "Can't chdir to $dir: $!";
my %file_ts = map { $_ => (stat)[9] } glob "*.txt";

# Get all records from the database and extract those with old timestamps    
my $table = 'files';
my $qry = qq(select fname,mtime,string from $table);    
my $rows = $dbh->selectall_arrayref($qry);
my @new_rows = grep { $_->[1] < $file_ts{$_->[0]} } @$rows;
say "Got ", 0+@$rows, " records, ", 0+@new_rows, " with new timestamps";

# Reprocess the updated files and update the record
foreach my $row (@new_rows) { 
    @$row[1,2] = ( $file_ts{$row->[0]}, proc_file($row->[0]) );
}

printf "Runtime so far: %.2f seconds\n", tv_interval($time_beg);  #--> 0.34

my $tot_updated = 0;
$qry = qq(update $table set mtime=?,string=? where fname=?);
my $sth = $dbh->prepare($qry);
foreach my $row (@new_rows) {
    $tot_updated += $sth->execute($sth);
}
say "Updated $tot_updated rows";

$dbh->disconnect;
printf "Runtime: %.2f seconds\n", tv_interval($time_beg);  #--> 1.54

sub proc_file {
    return join '_',
        split ' ', do { local (@ARGV, $/) = $_[0]; <> };
}

Run Code Online (Sandbox Code Playgroud)

这明确不打印。我忽略了这一点，因为有几种方法可以做到这一点，而我不确定到底需要打印什么。select在全部更新之后，我可能会为此目的运行另一个。

该程序在几次运行中平均需要大约 1.35 秒，非常一致。但是直到它update的数据库部分（很少！）更改它需要大约 0.35 秒，而且我不明白为什么update少数记录需要那么长时间。

接下来，为了进行比较，我们需要通过写入该缓存文件（此处遗漏的内容）来使用此 pos 的第一部分中的缓存文件来完成该方法。完整的程序与一开始的程序略有不同

use warnings;
use strict;
use feature 'say';    
use Cwd qw();

my ($dir, $cache) = ('dir', 'cache.txt');
if (not -f $cache) { 
    open my $fh, '>', $cache or die "Can't open $cache: $!";
    chdir $dir or die "Can't chdir to $dir: $!";
    my @fnames = glob "*.txt"; 
    for my $fname (@fnames) { 
        say $fh join ',', $fname, (stat $fname)[9],
            join '_', split ' ', do { local (@ARGV, $/) = $fname; <> };
    }
    say "Wrote cache file $cache, exiting.";
    exit;
}

open my $fh, '<', $cache or die "Can't open $cache $!";
my %fname = map { chomp; my @f = split /,/,$_,3; shift @f => \@f } <$fh>; #/

my $orig_dir = Cwd::cwd;
chdir $dir or die "Can't chdir to $dir: $!";
my @fnames = glob "*.txt";

for my $f (@fnames) {
    my $mtime = (stat $f)[9];

    # Have to process the file (and update its record)
    if ( $fname{$f}->[0] < $mtime ) { 
        @{$fname{$f}} = ($mtime, proc_file($f));
        say "Processed $f, updated with: @{$fname{$f}}";
    }   

    #say $fname{$_}->[1];  # 50k files! suppressed for feasible testing
}

# Update the cache
chdir $orig_dir  or die "Can't chdir to $orig_dir: $!";
open my $fh_out, '>', $cache or die "Can't open $cache: $!";
say $fh_out join(',', $_, @{$fname{$_}}) for keys %fname;


sub proc_file {
    return join '_', 
        split ' ', do { local (@ARGV, $/) = $_[0]; <> };
}

Run Code Online (Sandbox Code Playgroud)

写入缓存最初需要大约 1 秒。touch像在sqlite测试中一样对几个文件进行-ed后，该程序的下一次运行仍然相当一致，大约需要 0.45 秒。

通过这些测试，我必须得出结论，该sqlite方法在这些条件下有点慢。然而，它肯定更具可扩展性，而项目只会增加规模。还记得update数据库的需要相当多（相对），这让我感到惊讶；我的代码可能有问题，可能会加快速度。

归档时间：	5 年，6 月前
查看次数：	171 次
最近记录：	5 年，6 月前