如何使用Perl提取paragaph和选定的行？

Question

如何使用Perl提取paragaph和选定的行？

我有一个文本,我需要:

在"Aceview summary"部分提取整段,直到以"Please quote"开头的行(不包括在内).
提取以"最接近的人类基因"开头的行.
将它们存储到具有两个元素的数组中.

文本看起来像这样(也在pastebin上):

  AceView: gene:1700049G17Rik, a comprehensive annotation of human, mouse and worm genes with mRNAs or ESTsAceView.

  <META NAME="title"
 CONTENT="
AceView: gene:1700049G17Rik a comprehensive annotation of human, mouse and worm genes with mRNAs or EST">

<META NAME="keywords"
 CONTENT="
AceView, genes, Acembly, AceDB, Homo sapiens, Human,
 nematode, Worm, Caenorhabditis elegans , WormGenes, WormBase, mouse,
 mammal, Arabidopsis, gene, alternative splicing variant, structure,
 sequence, DNA, EST, mRNA, cDNA clone, transcript, transcription, genome,
 transcriptome, proteome, peptide, GenBank accession, dbest, RefSeq,
 LocusLink, non-coding, coding, exon, intron, boundary, exon-intron
 junction, donor, acceptor, 3'UTR, 5'UTR, uORF, poly A, poly-A site,
 molecular function, protein annotation, isoform, gene family, Pfam,
 motif ,Blast, Psort, GO, taxonomy, homolog, cellular compartment,
 disease, illness, phenotype, RNA interference, RNAi, knock out mutant
 expression, regulation, protein interaction, genetic, map, antisense,
 trans-splicing, operon, chromosome, domain, selenocysteine, Start, Met,
 Stop, U12, RNA editing, bibliography">
<META NAME="Description" 
 CONTENT= "
AceView offers a comprehensive annotation of human, mouse and nematode genes
 reconstructed by co-alignment and clustering of all publicly available
 mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable
 up-to-date resource on the genes, their functions, alternative variants,
 expression, regulation and interactions, in the hope to stimulate
 further validating experiments at the bench
">


<meta name="author"
 content="Danielle Thierry-Mieg and Jean Thierry-Mieg,
 NCBI/NLM/NIH, mieg@ncbi.nlm.nih.gov">




   <!--
    var myurl="av.cgi?db=mouse" ;
    var db="mouse" ;
    var doSwf="s" ;
    var classe="gene" ;
  //-->

Run Code Online (Sandbox Code Playgroud)

但是我坚持使用以下脚本逻辑.什么是实现这一目标的正确方法？

   #!/usr/bin/perl -w

   my  $INFILE_file_name = $file;      # input file name

    open ( INFILE, '<', $INFILE_file_name )
        or croak "$0 : failed to open input file $INFILE_file_name : $!\n";


    my @allsum;

    while ( <INFILE> ) {
        chomp;

        my $line = $_;

        my @temp1 = ();
        if ( $line =~ /^ AceView summary/ ) {
            print "$line\n";
            push @temp1, $line;
        }
        elsif( $line =~ /Please quote/) {
            push @allsum, [@temp1];
             @temp1 = ();
        }
        elsif ($line =~ /The closest human gene/) {

            push @allsum, $line;
        }

    }

    close ( INFILE );           # close input file
    # Do something with @allsum

Run Code Online (Sandbox Code Playgroud)

我需要处理许多文件.

Answer 1

Eug*_*ash 5

您可以在标量上下文中使用范围运算符来提取整个段落:

while (<INFILE>) {
    chomp;
    if (/AceView summary/ .. /Please quote/) {
        print "$_\n";
    }

    print "$_\n" if /^The closest human gene/;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，4 月前
查看次数：	474 次
最近记录：	15 年，4 月前