从12GB文件中删除特定行

mr_*_*air 9 unix bash perl file

我正在尝试从12GB的文本文件中删除特定的行.

sed -i在HP-UX上没有该选项,并且保存到临时文件等其他选项无法正常工作,因为我只有20GB可用空间,文本文件已经使用了12 GB.

考虑到空间要求,我正在尝试使用Perl.

此解决方案适用于从12 GB的文件中删除最后9行.

#!/usr/bin/env perl

use strict;
use warnings;

use Tie::File;

tie my @lines, 'Tie::File', 'test.txt' or die "$!\n";
$#lines -= 9;
untie @lines;
Run Code Online (Sandbox Code Playgroud)

我想修改上面的代码以删除任何特定的行号.

ike*_*ami 12

Tie :: File永远不是答案.

  • 这太疯狂了.
  • 即使您限制其缓冲区的大小,它也可以消耗更多内存,而不仅仅是将整个文件放入内存中.

你遇到了这两个问题.您遇到文件的每一行,因此Tie :: File将读取整个文件并将每行的索引存储在内存中.这在64位构建的Perl上每行需要28个字节(不计算内存分配器中的任何开销).


要删除文件的最后9行,可以使用以下命令:

use File::ReadBackwards qw( );

my $qfn = '...';

my $pos;
{
   my $bw = File::ReadBackwards->new($qfn)
      or die("Can't open \"$qfn\": $!\n");

   for (1..9) {
      defined( my $line = $bw->readline() )
         or last;
   }

   $pos = $bw->tell();
}

# Can't use $bw->get_handle because it's a read-only handle.
truncate($qfn, $pos)
   or die("Can't truncate \"$qfn\": $!\n");
Run Code Online (Sandbox Code Playgroud)

要删除任意行,您可以使用以下内容:

my $qfn = '...';

open(my $fh_src, '<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");    
open(my $fh_dst, '+<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");

while (<$fh_src>) {
   next if $. == 9;  # Or "if /keyword/", or whatever condition you want.

   print($fh_dst $_)
      or die($!);
}

truncate($fh_dst, tell($fh_dst))
   or die($!);    
Run Code Online (Sandbox Code Playgroud)

以下优化版本假定只有一行(或行块)要删除:

use Fcntl qw( SEEK_CUR SEEK_SET );

use constant BLOCK_SIZE => 4*1024*1024;

my $qfn = 'file';

open(my $fh_src, '<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");
open(my $fh_dst, '+<:raw', $qfn)
   or die("Can't open \"$qfn\": $!\n");

my $dst_pos;
while (1) {
   $dst_pos = tell($fh_src);
   defined( my $line = <$fh_src> )
      or do {
         $dst_pos = undef;
         last;
      };

   last if $. == 9;  # Or "if /keyword/", or whatever condition you want.
}

if (defined($dst_pos)) {
   # We're switching from buffered I/O to unbuffered I/O,
   # so we need to move the system file pointer from where the
   # buffered read left off to where we actually finished reading.
   sysseek($fh_src, tell($fh_src), SEEK_SET)
      or die($!);

   sysseek($fh_dst, $dst_pos, SEEK_SET)
      or die($!);

   while (1) {
      my $rv = sysread($fh_src, my $buf, BLOCK_SIZE);
      die($!) if !defined($rv);
      last if !$rv;

      my $written = 0;
      while ($written < length($buf)) {
         my $rv = syswrite($fh_dst, $buf, length($buf)-$written, $written);
         die($!) if !defined($rv);
         $written += $rv;
      }
   }

   # Must use sysseek instead of tell with sysread/syswrite.    
   truncate($fh_dst, sysseek($fh_dst, 0, SEEK_CUR))
      or die($!);
}
Run Code Online (Sandbox Code Playgroud)

  • "*你需要文件中的行数...*"该死,你说得对.即使你在`for`循环中避免它,`splice`也会调用`FETCHSIZE`. (2认同)