And*_*wby 1 perl mojolicious mojo-useragent
我试图找出如何使用Mojo::DOMUTF8 (和其他格式......不仅仅是 UTF8)。它似乎搞乱了编码:
my $dom = Mojo::DOM->new($html);
$dom->find('script')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
$dom->find('style')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
$dom->find('script')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
my $html = "$dom"; # pass back to $html, now we have cleaned it up...
Run Code Online (Sandbox Code Playgroud)
这是我在保存文件而不通过 Mojo 运行它时得到的结果:
...然后通过 Mojo 一次:
FWIW,我正在使用 , 抓取 HTML 文件Path::Tiny:
my $utf8 = path($_[0])->slurp_raw;
据我了解,应该已经将字符串解码为可供 Mojo 使用的字节?
更新:在布莱恩的建议之后,我研究了如何找出编码类型以正确解码它。我尝试了 Encode::Guess 和其他一些方法,但他们似乎在很多方面都出错了。这似乎可以解决问题:
my $enc_tmp = `encguess $_[0]`;
my ($fname,$type) = split /\s+/, $enc_tmp;
my $decoded = decode( $type||"UTF-8", path($_[0])->slurp_raw );
Run Code Online (Sandbox Code Playgroud)
您正在读取原始八位字节,但没有对它们进行解码(将原始数据存储在 中$utf8)。然后你把它当作你已经解码了它,所以结果是 mojibake。
slurp_utf8将为您解码。open指示执行此操作。Mojo::File->slurp获取原始八位字节,因此您可以减少依赖项列表。use v5.10;\nuse utf8;\n\nuse open qw(:std :utf8);\nuse Path::Tiny;\nuse Mojo::File;\nuse Mojo::Util qw(decode);\n\nmy $filename = 'test.txt';\nopen my $fh, '>:encoding(UTF-8)', $filename;\nsay { $fh } "Copyright \xc2\xa9 2022";\nclose $fh;\n\nmy $octets = path($filename)->slurp_utf8;\n\nsay "===== Path::Tiny::slurp_raw, no decode";\nsay path($filename)->slurp_raw;\n\nsay "===== Path::Tiny::slurp_raw, decode";\nsay decode( 'UTF-8', path($filename)->slurp_raw );\n\nsay "===== Path::Tiny::slurp_utf8";\nsay path($filename)->slurp_utf8;\n\nsay "===== Mojo::File::slurp, decode";\nsay decode( 'UTF-8', Mojo::File->new($filename)->slurp );\nRun Code Online (Sandbox Code Playgroud)\n输出:
\n===== Path::Tiny::slurp_raw, no decode\nCopyright \xc3\x82\xc2\xa9 2022\n\n===== Path::Tiny::slurp_raw, decode\nCopyright \xc2\xa9 2022\n\n===== Path::Tiny::slurp_utf8\nCopyright \xc2\xa9 2022\n\n===== Mojo::File::slurp, decode\nCopyright \xc2\xa9 2022\nRun Code Online (Sandbox Code Playgroud)\n