Perl和解析凌乱的文本

sno*_*kin 1 perl parsing unpack

我有以下文字

                         Instructor First                          Number Students Who   Number Students Who
Subject Course Section                      Instructor Last Name                                               A    B C       D F
                         Name                                      Completed the Class   Dropped the Class
ACCT    201    01        Karin              Hatheway Dial          56                    6                     19   9    16   2   5
ACCT    202    01        Karin              Hatheway Dial          69                    11                    37   14   7    2   6
ACCT    205    01        Darryl             Woolley                20                    1                     3    7    6    1   3
ACCT    205    02        Darryl             Woolley                28                    1                     6    7    13       2
ACCT    205    03        Darryl             Woolley                42                    5                     4    13   21   1   3
ACCT    205    04        Darryl             Woolley                23                    1                     9    5    8    1
ACCT    205    05        Darryl             Woolley                30                    2                     11   7    9    2   1
ACCT    205    06        Darryl             Woolley                25                    3                     8    9    6    1   1
ACCT    275    01        Darryl             Woolley                33                    2                     7    15   9    1   1
ACCT    310    01        Marla              Kraut                  16                    1                     1    6    7    2
ACCT    310    02        Marla              Kraut                  64                                          5    43   15   1
ACCT    310    03        Marla              Kraut                  72                    3                     11   47   10   3   1
ACCT    311    01        Karin              Hatheway Dial          45                                          13   20   11   1
ACCT    311    02        Karin              Hatheway Dial          25                                          10   12   3
ACCT    315    01        Jason              Porter                 26                                          6    5    8    6   1
ACCT    315    02        Jason              Porter                 29                    1                     6    10   5    7   1
ACCT    414    01        Teresa             Gordon                 22                    1                     6    6    9    1
ACCT    483    01        Glen               Utzman                 26                    1                     7    13   6
ACCT    486    01        Teresa             Gordon                 33                                          13   14   6
ACCT    492    01        Jason              Wills                  23                                          5    8    9    1
ACCT    515    01        Jeffrey            Harkins                15                                          7    6    1
ACCT    561    01        Jason              Porter                 18                    1                     10   7    1
ADOL    526    13        Charles            Gagel                  21                    2                     19   1             1
ADOL    573    13        Martha             Yopp                   28                                          16   3             1
ADOL    574    01        Laura              Holyoke                16                                          12   3             1
ADOL    574    11        Laura              Holyoke                9                     1                     8    1
ADOL    574    13        Laura              Holyoke                15                                          10   4             1
ADOL    600    13        Roger              Scott                  19                                          4         1
AERO    101    01        William            Beauter                11                                          8    2    1
AERO    103    01        Sarah              Babbitt                15                                          7    6    1        1
AERO    411    01        Sarah              Babbitt                11                                          6    4    1
AERO    413    01        Sarah              Babbitt                12                                          8    3    1
AGEC   101   01   Larry         Van Tassell   36    1    20   15        1
AGEC   278   01   Larry         Makus         21    1    2    6    8    5
AGEC   278   02   Larry         Makus         18         5    10   2    1
AGEC   278   03   Larry         Makus         17    1    2    7    5    2    1
AGEC   301   01   Christopher   McIntosh      18         9    4    5
AGEC   356   01   Joseph        Guenthner     23         15   6    2
AGEC   361   01   Ruby          Stroschein    11         4    1    6
AGEC   411   01   Robert        Haggerty      11         6    4    1
AGEC   413   01   Robert        Spear         12    3    4    5    2    1
AGEC   415   01   Larry         Van Tassell   11         10   1
AGEC   526   01   Scott         Matulich      7          2    5
AGEC   527   01   Stephen       Cooke         5          3    2
AGED   180   01   Lori          Moore         23    1    14   5    1    3
AGED   351   01   Lou           Riesenberg    11         4    6    1
AMST   301   01   Walter        Hesford       26         14   8    3         1
ANTH   100   01   Mark          Warner        104   15   31   31   21   8    12
ANTH   220   01   Fumiyasu      Arakawa       138   4    48   53   19   10   8
ANTH   230   01   Robert        Sappington    28    1    7    9    9    2    1
ANTH   251   01   Donald        Tyler         36    1    10   14   8    1    3
ANTH   420   01   Laura         Putsche       12         3    4    2         2
ANTH   422   01   Rodney        Frey          13         11                  2
ANTH   427   02   Virginia      Babcock       13    1         2    6 4       1
ANTH   462   01   Laura         Putsche       33    3    8    20   3 1
ARBC   101   01   Anisah        El-Mansouri   14    1    8    5    1
ARCH   151   01   Randall       Teal          150   8    72   40   13 6      19
ARCH   253   01   Roman         Montoto       23    1    9    10   2         1
ARCH   253   02   Randall       Teal          22    2    9    11   2
ARCH   253   03   Xiao          Hu            23    2    11   12
ARCH   353   01   Matthew       Brehm         16         7    7    1
ARCH   353   02   Dillon        Ellefson      16         4    11   1
ARCH   353   03   Xiao          Hu            10         4    6
ARCH   385   01   Anne          Marshall      68    5    29   22   11 2      4
ARCH   404   04   Matthew       Brehm         10         1    5    3 1
ARCH   453   01   Roman         Montoto       10         5    4    1
ARCH   453   02   Anne       Marshall              13        6     5             1
ARCH   463   01   Phillip    Mead                  63    1   26    31   5 1
ARCH   465   01   Kenneth    Carper                51    1   8     26   12 3
ARCH   483   01   D.         Reese                 71    2   27    35   8
ARCH   504   02   Randall    Teal                  15        9     6
ARCH   504   03   Kevin      Van Den Wymelenberg   6         3     1             1
ARCH   504   04   Frank      Jacobus               12    1   8     4
ARCH   510   02   D.         Reese                 13        9     4
ARCH   510   04   Robert     Thornton              9         7     1
ARCH   510   05   Roman      Montoto               11    2   7     4
ARCH   553   01   Bruce      Haglund               14        12    2
Run Code Online (Sandbox Code Playgroud)

我有这个代码/子获取每一行,并假设产生一个相关的列表:

sub GetData {

    my $non_nor_line              = shift;
    my( $subj, $crs,$sec, $rest ) = unpack "a6 a6 a6 a*", $non_nor_line;
    my $name                      = undef;
    my $upk_short  = q{A3A2A3A2 A3A2 A3AA5 A6};





    $rest =~ m/(.+?)\d/;
    $name = $1;
    $rest =~ s/$1//;
    $rest =~ s/^\s+//;
    $rest =~ s/\s+$//;
    my @rest_data                 = unpack($upk_short,$rest);    


    print $_ ."\n" foreach(@rest_data);


}
Run Code Online (Sandbox Code Playgroud)

我不知道如何从$ rest获取数据,我尝试了解压缩的许多变化,但无济于事,我需要将它存储到列表中.忽略'upk_short',它不正确,虽然我尝试了很多其他的,看起来线条太动态了.

更新:如果有人能找到一种规范化文本的方法,那就没关系了,我的意思是将所有内容对齐,以便我可以使用Tom的方式来解析它.

任何的想法?

tch*_*ist 5

#!/usr/bin/env perl

use strict;
use warnings;

sub cut2fmt {
    my @positions  = @_;
    my $template   = "";
    my $lastpos    = 1;
    for my $place (@positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos   = $place;
    }
    $template .= "A*";
    return $template;
}

my $fmt = cut2fmt(9, 16, 26, 45, 68, 90, 112, 117, 122, 127, 131);

my @keys = qw{

    subject                 course              section

    instructor_first_name   instructor_last_name

    completed_the_class     dropped_the_class

    grade_A                 grade_B
    grade_C                 grade_D
    grade_F

};

our @All_Records;

while (<DATA>) {
    next if 1 .. /^\s*\|/;
    my %rec;
    @rec{@keys} = unpack($fmt, $_);
    for my $key (grep { /^grade_[A-F]$/ } @keys) {
        $rec{$key} ||= 0;
    }
    push @All_Records, \%rec;
}

for my $rec (@All_Records) {
    for my $key (@keys) {
        print "$key: $rec->{$key}\n";
    }
    print "\n";

}

__END__
Subject Course Section                      Instructor Last Name                                               A    B C       D F
                         Name                                      Completed the Class   Dropped the Class
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
         1         2         3         4         5         6         7         8         9         0         1         2         3         4
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
        |      |         |                  |                      |                     |                     |    |    |    |   |
ACCT    201    01        Karin              Hatheway Dial          56                    6                     19   9    16   2   5
ACCT    202    01        Karin              Hatheway Dial          69                    11                    37   14   7    2   6
ACCT    205    01        Darryl             Woolley                20                    1                     3    7    6    1   3
ACCT    205    02        Darryl             Woolley                28                    1                     6    7    13       2
ACCT    205    03        Darryl             Woolley                42                    5                     4    13   21   1   3
ACCT    205    04        Darryl             Woolley                23                    1                     9    5    8    1
ACCT    205    05        Darryl             Woolley                30                    2                     11   7    9    2   1
ACCT    205    06        Darryl             Woolley                25                    3                     8    9    6    1   1
ACCT    275    01        Darryl             Woolley                33                    2                     7    15   9    1   1
ACCT    310    01        Marla              Kraut                  16                    1                     1    6    7    2
ACCT    310    02        Marla              Kraut                  64                                          5    43   15   1
ACCT    310    03        Marla              Kraut                  72                    3                     11   47   10   3   1
ACCT    311    01        Karin              Hatheway Dial          45                                          13   20   11   1
ACCT    311    02        Karin              Hatheway Dial          25                                          10   12   3
ACCT    315    01        Jason              Porter                 26                                          6    5    8    6   1
ACCT    315    02        Jason              Porter                 29                    1                     6    10   5    7   1
ACCT    414    01        Teresa             Gordon                 22                    1                     6    6    9    1
ACCT    483    01        Glen               Utzman                 26                    1                     7    13   6
ACCT    486    01        Teresa             Gordon                 33                                          13   14   6
ACCT    492    01        Jason              Wills                  23                                          5    8    9    1
ACCT    515    01        Jeffrey            Harkins                15                                          7    6    1
ACCT    561    01        Jason              Porter                 18                    1                     10   7    1
ADOL    526    13        Charles            Gagel                  21                    2                     19   1             1
ADOL    573    13        Martha             Yopp                   28                                          16   3             1
ADOL    574    01        Laura              Holyoke                16                                          12   3             1
ADOL    574    11        Laura              Holyoke                9                     1                     8    1
ADOL    574    13        Laura              Holyoke                15                                          10   4             1
ADOL    600    13        Roger              Scott                  19                                          4         1
AERO    101    01        William            Beauter                11                                          8    2    1
AERO    103    01        Sarah              Babbitt                15                                          7    6    1        1
AERO    411    01        Sarah              Babbitt                11                                          6    4    1
AERO    413    01        Sarah              Babbitt                12                                          8    3    1
Run Code Online (Sandbox Code Playgroud)

您要做的第一件事是规范化您的数据.你的列不一致,我不能告诉你为什么会这样.也许你有标签需要通过管道expand -8或其他东西.我只包括所有相同对齐的数据.

为了让你的unpack格式每次都正确,你只需要画一个像我在它下面的编号标尺.|在每个字段开始的位置放置标记.记录该数字,并将其传递给包含的cut2fmt()函数.它会将这些数字转换为pack/unpack模板.

这里的所有都是它的.

告诉你这些掘金队在哪里来,但我只是讨厌那些咄咄逼人的自我推销者,所以我弯腰这么低是虚伪的.我不会这样做.如果有人想做广告,那么,让他们从网站上购买垃圾邮件.那些讨厌垃圾广告的人可能会阻止我们的广告拦截器.否则对我来说只是不合适.