在perl中解析一个巨大的文本文件

Question

在perl中解析一个巨大的文本文件

我有一个文本文件,分隔标签.它们可以达到1 GB.我将根据其中的样本数量具有可变数量的列.每个样本有八列.例如,sampleA:ID1,id2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR_A,AR_5.其中ID1和id2是所有样本的共同点.我想要实现的是根据样本数将整个文件拆分为多个文件块.

ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487

Run Code Online (Sandbox Code Playgroud)

这是我的模型文件的外观,我想将它们作为:

File A : 
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853

File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948

File C:

ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.

Run Code Online (Sandbox Code Playgroud)

有没有比通过阵列更简单的方法？

我如何计算出我的逻辑是计算(标题数 - 2)并将它们除以8将得到文件中的样本数.然后遍历数组中的每个元素并解析它们.这样做似乎是一种乏味的方式.我很乐意知道任何更简单的方法来处理这个问题.

谢谢西普拉

Answer 1

Dav*_*oss 8

#!/bin/env perl

use strict;
use warnings;

# open three output filehandles
my %fh;
for (qw[A B C]) {
  open $fh{$_}, '>', "file$_" or die $!;
}

# open input
open my $in, '<', 'somefile' or die $!;

# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;

while (<$in>) {
  chomp;
  my @data = split /,/;

  print $fh{A} join(',', @data[0 .. 9]), "\n";
  print $fh{B} join(',', @data[0, 1, 10 .. 17]), "\n";
  print $fh{C} join(',', @data[0, 1, 18 .. $#data]), "\n";
}

Run Code Online (Sandbox Code Playgroud)

更新:我感到无聊并且变得更加聪明,因此它会自动处理文件中的任意数量的8列记录.不幸的是,我没有时间解释它或添加评论.

#!/usr/bin/env perl

use strict;
use warnings;

# open input
open my $in, '<', 'somefile' or die $!;

chomp(my $head = <$in>);
my @cols = split/,/, $head;

die 'Invalid number of records - ' . @cols . "\n"
  if (@cols -2) % 8;

my @files;
my $name = 'A';
foreach (1 .. (@cols - 2) / 8) {
   my %desc;
   $desc{start_col} = (($_ - 1) * 8) + 2;
   $desc{end_col}   = $desc{start_col} + 7;
   open $desc{fh}, '>', 'file' . $name++ or die $!;
   print {$desc{fh}} join(',', @cols[0,1],
                               @cols[$desc{start_col} .. $desc{end_col}]),
                     "\n";

   push @files, \%desc;
}

while (<$in>) {
  chomp;
  my @data = split /,/;

  foreach my $f (@files) {
    print {$f->{fh}} join(',', @data[0,1],
                               @data[$f->{start_col} .. $f->{end_col}]),
                   "\n";
   }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，6 月前
查看次数：	2786 次
最近记录：	14 年，6 月前