Perl UTF8编码错误.LWP :: UserAgent-> decoding_content或Encode :: decode都不起作用.其他想法？

Question

Perl UTF8编码错误.LWP :: UserAgent-> decoding_content或Encode :: decode都不起作用.其他想法？

Mon*_*sto 4 perl decode utf-8 character-encoding lwp-useragent

当我尝试使用LWP :: Useragent和Encode进行字符编码从网页中撤回全局地址时,我在perl中有编码问题.我试过谷歌搜索解决方案,但似乎没有任何工作.我正在使用Strawberry Perl 5.12.3.

以美国驻捷克共和国大使馆的地址页为例(http://prague.usembassy.gov/contact.html).我想要的只是撤回地址:

地址:Tržiště15118 01 Praha 1 - MaláStrana捷克共和国

哪个firefox使用字符编码UTF-8正确显示,UTF-8与网页标题字符集相同.但是当我尝试使用perl将其拉回并将其写入文件时,尽管在Useragent或Encode :: decode中使用了decoding_content,编码看起来仍然搞砸了.

我已经尝试在数据上使用正则表达式来检查错误是不是在打印数据时(即内部在perl中正确)但错误似乎在于perl如何处理编码.

这是我的代码:

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

Run Code Online (Sandbox Code Playgroud)

这是终端的输出:

内容类型:UTF-8未解码的内容::
Tr├à┬╛išt├ä┬¢15
118 01布拉格1 -布拉格小城
捷克共和国解码的内容::
Tr┼╛išt─¢15
118 01布拉格1 -布拉格小城
捷克共和国
编码:: 解密内容:: Tr┼╛išt─¢15
118 01 Praha 1 - MaláStrana
捷克共和国DOUBLE-DECODED CONTENT ::Tr┼╛išt─¢15
118 01 Praha 1 - MaláStrana捷克共和国AMPERSAND发现内容 -可能编码错误的AMP和发现错误的内容 - 可能编码错误的AMP和在编码中发现::解码的内容 - 可能编码错误的AMPERS和双重解码的内容 - 可能的编码错误

并且到文件(注意这与终端略有不同但不正确).OK WOW-这在堆栈溢出中显示正确,但在Bluefish,LibreOffice,Excel,Word或我的计算机上的任何其他内容中都没有.所以数据只是编码不正确.我真的不知道发生了什么.

内容类型:UTF-8未解码的内容::
TrÅ¾ištÄ15
118 01布拉格1 -布拉格小城
捷克共和国解码的内容::
Tržiště15
118 01布拉格1 -布拉格小城
捷克共和国ENCODE ::解码的内容::
Tržiště15
118 01布拉格1 -布拉格小城
捷克双解码的内容::Tržiště15
118 01布拉格1 -的MaláStranaCzech共和国AMPERSAND发现,在未解码的内容- LIKELY编码误差AMPERSAND发现DECODED内容- LIKELY编码误差AMPERSAND发现ENCODE :: DECODED内容- LIKELY编码误差AMPERSAND发现双重解码内容 - 可能编码错误

任何指示如何做到这一点真的很感激.

谢谢,Ian/Montecristo

Answer 1

dax*_*xim 5

错误是使用正则表达式来解析HTML.至少你缺乏对HTML实体的解码.您可以手动执行此操作,也可以将其保留为健壮的解析器:

use strictures;
use Web::Query 'wq';
use autodie qw(:all);

open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text

Run Code Online (Sandbox Code Playgroud)

@Montecristo,尝试移动 - 并且会发现perl的unicode支持是最先进和最强大的.简单地说,使用5.14.我走了很长的路径perl - > python - > ruby - > perl.(浪费时间). (6认同)
@Montecristo是perl的unicode支持的唯一"问题",而不是perl正确的做法.所以,当做正确的时候,这里没有捷径.许多语言都有快捷方式,因此在第一次使用时,它们似乎更容易.但是后来,你发现了他们的极限.简单地说,unicode是一个复杂的东西,perl必须保持与20k + CPAN模块的向后兼容性等等.因此,(在一开始)事情似乎变得复杂.不幸的是 - 如果你想编写正确的unicode程序,只需要了解unicode是什么.阅读着名的tchrist的帖子:http://stackoverflow.com/a/6163129/632407 (2认同)

归档时间：	13 年，5 月前
查看次数：	2271 次
最近记录：	13 年，5 月前