在perl中解码UTF-8 JSON的问题

hlo*_*dal 4 perl json utf-8

使用JSON库处理时会破坏UTF-8字符(这可能类似于在perl中解码unicode JSON的问题,但设置binmode只会产生另一个问题).

我已将问题减少到以下示例:

(hlovdal) localhost:/tmp/my_test>cat my_test.pl
#!/usr/bin/perl -w
use strict;
use warnings;
use JSON;
use File::Slurp;
use Getopt::Long;
use Encode;

my $set_binmode = 0;
GetOptions("set-binmode" => \$set_binmode);

if ($set_binmode) {
        binmode(STDIN,  ":encoding(UTF-8)");
        binmode(STDOUT, ":encoding(UTF-8)");
        binmode(STDERR, ":encoding(UTF-8)");
}

sub check {
        my $text = shift;
        return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}

my $my_test = "hei på deg";
my $json_text = read_file('my_test.json');
my $hash_ref = JSON->new->utf8->decode($json_text);

print check($my_test), "\$my_test = $my_test\n";
print check($json_text), "\$json_text = $json_text";
print check($$hash_ref{'my_test'}), "\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} .  "\n";

(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)

在运行测试时,文本由于某种原因被压缩到iso-8859-1中.设置binmode排序可以解决它,但随后会导致其他字符串的双重编码.

(hlovdal) localhost:/tmp/my_test>cat my_test.json 
{ "my_test" : "hei på deg" }
(hlovdal) localhost:/tmp/my_test>file my_test.json 
my_test.json: UTF-8 Unicode text
(hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json 
0000000   {       "   m   y   _   t   e   s   t   "       :       "   h
0000010   e   i       p 303 245       d   e   g   "       }  \n        
000001e
(hlovdal) localhost:/tmp/my_test>
(hlovdal) localhost:/tmp/my_test>perl my_test.pl
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p? deg
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)

是什么导致了这个以及如何解决?


这是一个新安装的和最新的Fedora 15系统.

(hlovdal) localhost:/tmp/my_test>perl --version | grep version
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux-thread-multi
(hlovdal) localhost:/tmp/my_test>rpm -q perl-JSON
perl-JSON-2.51-1.fc15.noarch
(hlovdal) localhost:/tmp/my_test>locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)

更新:添加use utf8没有解决它,字符仍然没有正确处理(虽然与以前略有不同):

(hlovdal) localhost:/tmp/my_test>perl  my_test.pl
is_utf8(): 1, is_utf8(1): 1. $my_test = hei p? deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p? deg
(hlovdal) localhost:/tmp/my_test>perl  my_test.pl --set-binmode
is_utf8(): 1, is_utf8(1): 1. $my_test = hei på deg
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" }
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg
(hlovdal) localhost:/tmp/my_test>
Run Code Online (Sandbox Code Playgroud)

perlunifaq所述

我可以在Perl源中使用Unicode吗?

是的你可以!如果您的源是UTF-8编码,您可以使用utf8 pragma指示.

use utf8;
Run Code Online (Sandbox Code Playgroud)

这对您的输入或输出无效.它只会影响您的源读取方式.您可以在字符串文字中使用Unicode,在标识符中(但根据\ w,它们仍然必须是"单词字符"),甚至在自定义分隔符中也是如此.

dax*_*xim 9

你用UTF-8保存了程序,但忘了告诉Perl.添加use utf8;.

而且,你编程太复杂了.JSON函数DWYM.要检查内容,请使用Devel :: Peek.

use utf8; # for the following line
my $my_test = 'hei på deg';

use Devel::Peek qw(Dump);
use File::Slurp (read_file);
use JSON qw(decode_json);

my $hash_ref = decode_json(read_file('my_test.json'));

Dump $hash_ref; # Perl character strings
Dump $my_test;  # Perl character string
Run Code Online (Sandbox Code Playgroud)

  • 将Perl字符串打印到STDOUT是错误的; 你必须首先将它们编码成八位字节,要么明确地使用编码qw(编码); print encode'UTF-8',$ my_test;`或implicitely`binmode STDOUT,':encoding(UTF-8)'; print $ my_test;`.此外,这不是ISO-8859-1,但Perl的内部编码由于历史原因而非常复杂,只是*发生*看起来像你说的那样.您需要严格介绍编码主题,请阅读http://p3rl.org/UNI (3认同)