我有一个充满链接的.html文件,我想提取没有http://的域名(所以只是链接的主机名部分,例如blah.com)列出它们并删除重复项.
这是我到目前为止所提出的 - 我认为问题是我试图传递$ tree数据的方式
#!/usr/local/bin/perl -w
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI;
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my $u1 = URI->new($tree);
print "host: ", $u1->host, "\n";
print "Hey, here's a dump of the parse tree of $file_name:\n";
# Now that we're done with it, we must destroy it.
# $tree = $tree->delete; # Not required with weak references
}
Run Code Online (Sandbox Code Playgroud) 试图匹配tr的是有一个项目class有任何的前三个字母开头eve或day.这是我的尝试:
my @stuff = $p->look_down(
_tag => 'tr',
class => 'qr/eve*|day*/g'
);
foreach (@stuff) {
print $_->as_text;
};
Run Code Online (Sandbox Code Playgroud)
只是好奇,有什么样的物品@stuff?
这个可以吗?见下文:
my @stuff = $p->look_down(
_tag => 'tr',
class => qr/eve.*|day.*/
);
print "\n\n";
foreach (@stuff) {
print $_->as_text . "\n\n";
};
Run Code Online (Sandbox Code Playgroud) 试图制作一个 Perl 脚本来打开一个 HTML 文件并提取<span class="postertrip">标签中包含的任何内容。
示例 HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span> 08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td …Run Code Online (Sandbox Code Playgroud)