我想检查一个站点的链接,然后递归检查这些站点的链接.但我不想两次获取同一页面.我遇到了逻辑问题.这是Perl代码:
my %urls_to_check = ();
my %checked_urls = ();
&fetch_and_parse($starting_url);
use Data::Dumper; die Dumper(\%checked_urls, \%urls_to_check);
sub fetch_and_parse {
my ($url) = @_;
if ($checked_urls{$url} > 1) { return 0; }
warn "Fetching 'me' links from $url";
my $p = HTML::TreeBuilder->new;
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req, sub { $p->parse($_[0])});
$p->eof();
my $base = $res->base;
my @tags = $p->look_down(
"_tag", "a",
);
foreach my $e (@tags) {
my $full = url($e->attr('href'), $base)->abs;
$urls_to_check{$full} = 1 if (!defined($checked_urls{$full}));
}
foreach my $url (keys %urls_to_check) {
delete $urls_to_check{$url};
$checked_urls{$url}++;
&fetch_and_parse($url);
}
}
Run Code Online (Sandbox Code Playgroud)
但这似乎并没有真正做到我想要的.
救命?!
编辑:我想从中获取URL $starting_url,然后从生成的提取中获取任何和所有URL.但是,如果其中一个URL链接回来$starting_url,我不想再次获取它.