使用Perl或PHP解析大型html文件(本地)

zer*_*ero 0 php perl text-parsing

我有一个大文档 - 我需要解析它并只吐出这部分:schule.php?schulnr = 80287&lschb =

我该怎么解析这些东西!?

<td>
    <A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
        <center><img border=0 height=16 width=15 src="sh_info.gif"></center>
    </A>
</td>
Run Code Online (Sandbox Code Playgroud)

期待您的来信

Chr*_*ris 5

您应该使用像DOM Simple HTML DOM Parser这样的DOM解析器

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';
Run Code Online (Sandbox Code Playgroud)


Axe*_*man 5

在Perl中,我知道扫描HTML是最快捷,最好的方法HTML::PullParser.这是基于强大的HTML解析器,而不是简单的FSA,如Perl正则表达式(没有递归).

这更像是SAX过滤器,而不是DOM.

use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::PullParser ();

my $pp 
    = HTML::PullParser->new(
      # your file or even a handle
      file        => 'my.html'
      # specifies that you want a tuple of tagname, attribute hash
    , start       => 'tag, attr' 
      # you only want to look at tags with tagname = 'a'
    , report_tags => [ 'a' ],
    ) 
    or die "$OS_ERROR"
    ;

my $anchor_url;
while ( defined( my $t = $pp->get_token )) { 
    next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
    my $href = $t->[1]->{href};
    if ( index( $href, 'schule.php?' ) > NOT_FOUND ) { 
        $anchor_url = $href;
        last;
    }
}
Run Code Online (Sandbox Code Playgroud)