use*_*810 5 perl uri html-treebuilder
我有一个充满链接的.html文件,我想提取没有http://的域名(所以只是链接的主机名部分,例如blah.com)列出它们并删除重复项.
这是我到目前为止所提出的 - 我认为问题是我试图传递$ tree数据的方式
#!/usr/local/bin/perl -w
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI;
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my $u1 = URI->new($tree);
print "host: ", $u1->host, "\n";
print "Hey, here's a dump of the parse tree of $file_name:\n";
# Now that we're done with it, we must destroy it.
# $tree = $tree->delete; # Not required with weak references
}
Run Code Online (Sandbox Code Playgroud)
小智 4
就我个人而言,我会使用 Mojo::DOM 来实现此目的,并使用 URI 模块来提取域:`
use Mojo::DOM;
use URI;
use List::AllUtils qw/uniq/;
my @domains = sort +uniq
map eval { URI->new( $_->{href} )->authority } // (),
Mojo::DOM->new( $html_code )->find("a[href]")->each;
Run Code Online (Sandbox Code Playgroud)
(PS 处理异常->authority是因为某些 URI 会在这里发出嘎嘎声;例如 mailto:s)
| 归档时间: |
|
| 查看次数: |
245 次 |
| 最近记录: |