是否可以使用Perl解析这个噩梦?

Che*_*eso 1 regex perl parsing

我正在处理一些doc文件,当复制并粘贴到文本文件中时,会给我以下示例'output':

ARTA215   ADVANCED LIFE DRAWING (3 Cr) (2:2)  + Studio 1 hr.
This advanced study in drawing with the life ....
Prerequisite: ARTA150
Lab Fee Required

ARTA220   CERAMICS II  (3 Cr) (2:2)  + Studio 1 hr.
This course affords the student the opportunity to ex...
Lab Fee Required

ARTA250   SPECIAL TOPICS IN ART 
  This course focuses on selected topic....

ARTA260   PORTFOLIO DEVELOPMENT   (3 Cr) (3:0)
The purpose of this course is to pre....
BIOS010   INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2) 
This course is a preparatory course designed to familiarize the begi....

BIOS101   GENERAL BIOLOGY (4 Cr) (3:3)
This course introduces the student to the principles of mo...
Lab Fee Required

BIOS102   INTRODUCTION TO HUMAN BIOLOGY  (4 Cr)  (3:3)
This course is an introd....
Lab Fee Required
Run Code Online (Sandbox Code Playgroud)

我希望能够解析它,以便生成3个字段,我可以将值输出到.csv文件中.

换行符,间距等......就是在这个文件中的任何一点.

我最好的猜测是正则表达式找到4个大写字母字符后跟3个字符字符,然后查明接下来的2个字符是否大写.(这说明了课程#,但也排除了在第一个条目中可能说"先决条件"的地方绊倒的可能性).在此之后,正则表达式找到第一个换行符并获取它之后的所有内容,直到找到下一个过程#.3个字段将是课程编号,课程标题和课程描述.课程编号和标题始终在同一行,描述是下面的一切.

样本最终结果将包含3个字段,我猜测它们可以存储到3个数组中:

"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2)  + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
Run Code Online (Sandbox Code Playgroud)


就像我说的那样,这是一场噩梦,但是我希望自动完成此操作,而不是每次生成文件后都清理一下.

Gre*_*con 11

考虑以下示例,该示例依赖于课程描述块完全包含在Perl认为是段落的内容中:

#! /usr/bin/perl

$/ = "";
my $record_start = qr/
  ^            # starting with a newline
  \s*          # allow optional leading whitespace
  ([A-Z]+\d+)  # capture course tag, e.g., ARTA215
  \s+          # separating whitespace
  (.+?)        # course title on rest of line
  \s*\n        # consume trailing whitespace
/mx;

while (<>) {
  my($course,$title);
  if (s/\A$record_start//) {  # fix Stack Overflow highlighting /
    ($course,$title) = ($1,$2);
  }
  elsif (s/(?s:^.+?)(?=$record_start)//) {  # ditto /
    redo;
  }
  else {
    next;
  }

  my $desc;
  die unless s/^(.+?)(?=$record_start|\s*$)//s;
  (my $desc = $1) =~ s/\s*\n\s*/ /g;
  for ($course, $title, $desc) {
    s/^\s+//; s/\s+$//; s/\s+/ /g;
  }
  print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
  redo if $_;
}
Run Code Online (Sandbox Code Playgroud)

输入样品输入后,输出

"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
"ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required"
"ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...."
"ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...."
"BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...."
"BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required"
"BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"

  • 假设样本输入字面上是正确的,你不能使用段落模式...你错过了BIOS010. (2认同)

yst*_*sth 7

尝试:

my $course;
my @courses;
while ( my $line = <$input_handle> ) {
    if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
        $course = [ "$1", "$2" ];
        push @courses, $course;
    }
    elsif ($course) {
        $course->[2] .= $line
    }
    else {
        # garbage before first course in file
        next
    }
}
Run Code Online (Sandbox Code Playgroud)

这会产生一个数组数组,据我所知你想要的.有一个哈希数组甚至哈希哈希值对我来说更有意义.

  • 当然,我们总是可以通过在末尾添加类似的东西来"修复"完全可理解的代码片段以产生所需的输出:print join"\n",map {join',',map {s /(\ r \n | \n)的// GS; qq {"$ _"}} @ $ _} @courses; (2认同)