使用JSON库处理时会破坏UTF-8字符(这可能类似于在perl中解码unicode JSON的问题,但设置binmode只会产生另一个问题).
我已将问题减少到以下示例:
(hlovdal) localhost:/tmp/my_test>cat my_test.pl
#!/usr/bin/perl -w
use strict;
use warnings;
use JSON;
use File::Slurp;
use Getopt::Long;
use Encode;
my $set_binmode = 0;
GetOptions("set-binmode" => \$set_binmode);
if ($set_binmode) {
binmode(STDIN, ":encoding(UTF-8)");
binmode(STDOUT, ":encoding(UTF-8)");
binmode(STDERR, ":encoding(UTF-8)");
}
sub check {
my $text = shift;
return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}
my $my_test = "hei på deg";
my $json_text = read_file('my_test.json');
my $hash_ref = JSON->new->utf8->decode($json_text);
print check($my_test), "\$my_test = $my_test\n";
print check($json_text), "\$json_text = $json_text";
print check($$hash_ref{'my_test'}), "\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} . "\n";
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)
在运行测试时,文本由于某种原因被压缩到iso-8859-1中.设置binmode排序可以解决它,但随后会导致其他字符串的双重编码.
(hlovdal) localhost:/tmp/my_test>cat my_test.json
{ "my_test" : "hei på deg" }
(hlovdal) localhost:/tmp/my_test>file my_test.json
my_test.json: UTF-8 Unicode text
(hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json
0000000 { " m y _ t e s t " : " h
0000010 e i p 303 245 d e g " } \n
000001e
(hlovdal) localhost:/tmp/my_test>
(hlovdal) localhost:/tmp/my_test>perl my_test.pl
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p? deg
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)
是什么导致了这个以及如何解决?
这是一个新安装的和最新的Fedora 15系统.
(hlovdal) localhost:/tmp/my_test>perl --version | grep version
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux-thread-multi
(hlovdal) localhost:/tmp/my_test>rpm -q perl-JSON
perl-JSON-2.51-1.fc15.noarch
(hlovdal) localhost:/tmp/my_test>locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)
更新:添加use utf8没有解决它,字符仍然没有正确处理(虽然与以前略有不同):
(hlovdal) localhost:/tmp/my_test>perl my_test.pl
is_utf8(): 1, is_utf8(1): 1. $my_test = hei p? deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p? deg
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode
is_utf8(): 1, is_utf8(1): 1. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)
我可以在Perl源中使用Unicode吗?
是的你可以!如果您的源是UTF-8编码,您可以使用utf8 pragma指示.
Run Code Online (Sandbox Code Playgroud)use utf8;这对您的输入或输出无效.它只会影响您的源读取方式.您可以在字符串文字中使用Unicode,在标识符中(但根据\ w,它们仍然必须是"单词字符"),甚至在自定义分隔符中也是如此.
你用UTF-8保存了程序,但忘了告诉Perl.添加use utf8;.
而且,你编程太复杂了.JSON函数DWYM.要检查内容,请使用Devel :: Peek.
use utf8; # for the following line
my $my_test = 'hei på deg';
use Devel::Peek qw(Dump);
use File::Slurp (read_file);
use JSON qw(decode_json);
my $hash_ref = decode_json(read_file('my_test.json'));
Dump $hash_ref; # Perl character strings
Dump $my_test; # Perl character string
Run Code Online (Sandbox Code Playgroud)