w.k*_*w.k 0 perl utf-8 html-parsing html-tree
我写了一个脚本,我在UTF-8编码的HTML文件中啜饮,然后将其解析为树HTML::Tree.问题是解析后的字符串不再标记为UTF-8.
由于_utf8_on()不建议设置标志的方式,我正在寻找正确的方法.
我的简化代码示例:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8::all;
use autodie;
use HTML::Tree;
use Encode qw/is_utf8/;
my $file = shift;
my $tree;
if ($file) {
my $content = slurp_in( 'file' => $file );
$tree = html_tree('content' => $content);
} else {
die "no file";
}
my $title = $tree->look_down(_tag => 'title');
$title = $title->as_HTML('');
if ( is_utf8( $title ) ) {
say "OK: $title";
} else {
say "NOT OK: $title";
}
## SUBS
##
sub slurp_in {
my %v = @_;
open(my $fh, "<:utf8", $v{file}) || die "no $v{file}: $!";
local $/;
my $content = (<$fh>);
close $fh;
if ($content) {
return $content;
} else {
die "no content in $v{file} !";
}
}
sub html_tree {
my %v = @_;
my $tree = HTML::Tree->new();
$tree->utf8_mode(1); ## wrong call here, no such method, but no warnings on it!
$tree->parse( $v{content} );
if ($tree) {
return $tree;
} else {
die "no tree here";
}
}
Run Code Online (Sandbox Code Playgroud)
您的代码过于复杂,您使用utf8 :: all并手动解码并一次调用该奇怪的方法.修辞地问,你期望以这种方式实现什么?我没有耐心去找出出错的细节和地点,特别是因为你没有发布你的程序未能达到预期的任何输入,所以我大大减少了它更简单的一个.这有效:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings FATAL => ':all';
use File::Slurp qw(read_file); # autodies on error
use HTML::Tree qw();
my $file = shift;
die 'no file' unless $file;
my $tree = HTML::Tree->new_from_content(
read_file($file, binmode => ':encoding(UTF-8)')
);
my $title = $tree->look_down(_tag => 'title');
$title->as_HTML(''); # returns a Perl string
Run Code Online (Sandbox Code Playgroud)