hlo*_*dal 12 unicode perl json utf-8
与此问题和此答案(另一个问题)相关,我仍然无法使用JSON处理UTF-8.
我已经尝试确保根据最好的专家的建议调用所有必需的巫术,并且据我所知,该字符串尽可能有效,标记并标记为UTF-8.但仍然perl死于其中任何一个
Uncaught exception: malformed UTF-8 character in JSON string
Run Code Online (Sandbox Code Playgroud)
要么
Uncaught exception: Wide character in subroutine entry
Run Code Online (Sandbox Code Playgroud)
我在这做错了什么?
(hlovdal) localhost:/work/2011/perl_unicode>cat json_malformed_utf8.pl
#!/usr/bin/perl -w -CSAD
### BEGIN ###
# Apparently the very best perl unicode boiler template code that exist,
# https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
# Slightly modified.
use v5.12; # minimal for unicode string feature
#use v5.14; # optimal for unicode string feature
use utf8; # Declare that this source unit is encoded as UTF?8. Although
# once upon a time this pragma did other things, it now serves
# this one singular purpose alone and no other.
use strict;
use autodie;
use warnings; # Enable warnings, since the previous declaration only enables
use warnings qw< FATAL utf8 >; # strictures and features, not warnings. I also suggest
# promoting Unicode warnings into exceptions, so use both
# these lines, not just one of them.
use open qw( :encoding(UTF-8) :std ); # Declare that anything that opens a filehandles within this
# lexical scope but not elsewhere is to assume that that
# stream is encoded in UTF?8 unless you tell it otherwise.
# That way you do not affect other module’s or other program’s code.
use charnames qw< :full >; # Enable named characters via \N{CHARNAME}.
use feature qw< unicode_strings >;
use Carp qw< carp croak confess cluck >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
END { close STDOUT }
if (grep /\P{ASCII}/ => @ARGV) {
@ARGV = map { decode("UTF-8", $_) } @ARGV;
}
$| = 1;
binmode(DATA, ":encoding(UTF-8)"); # If you have a DATA handle, you must explicitly set its encoding.
# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
confess "Uncaught exception: @_" unless $^S;
};
# now promote run-time warnings into stackdumped exceptions
# *unless* we're in an try block, in which
# case just generate a clucking stackdump instead
local $SIG{__WARN__} = sub {
if ($^S) { cluck "Trapped warning: @_" }
else { confess "Deadly warning: @_" }
};
### END ###
use JSON;
use Encode;
use Getopt::Long;
use Encode;
my $use_nfd = 0;
my $use_water = 0;
GetOptions("nfd" => \$use_nfd, "water" => \$use_water );
print "JSON->backend->is_pp = ", JSON->backend->is_pp, ", JSON->backend->is_xs = ", JSON->backend->is_xs, "\n";
sub check {
my $text = shift;
return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}
my $json_text = "{ \"my_test\" : \"hei på deg\" }\n";
if ($use_water) {
$json_text = "{ \"water\" : \"?\" }\n";
}
if ($use_nfd) {
$json_text = NFD($json_text);
}
print check($json_text), "\$json_text = $json_text";
# test from perluniintro(1)
if (eval { decode_utf8($json_text, Encode::FB_CROAK); 1 }) {
print "string is valid utf8\n";
} else {
print "string is not valid utf8\n";
}
my $hash_ref1 = JSON->new->utf8->decode($json_text);
my $hash_ref2 = decode_json( $json_text );
__END__
Run Code Online (Sandbox Code Playgroud)
运行这个给出
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei på deg" }
string is valid utf8
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl | ./uniquote
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei p\N{U+E5} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -nfd | ./uniquote
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei pa\N{U+30A} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "?" }
string is valid utf8
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water | ./uniquote
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water --nfd | ./uniquote
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
at ./json_malformed_utf8.pl line 46
main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>rpm -q perl perl-JSON perl-JSON-XS
perl-5.12.4-159.fc15.x86_64
perl-JSON-2.51-1.fc15.noarch
perl-JSON-XS-2.30-2.fc15.x86_64
(hlovdal) localhost:/work/2011/perl_unicode>
Run Code Online (Sandbox Code Playgroud)
uniquote来自 http://training.perl.com/scripts/uniquote
更新:
感谢brian突出解决方案.更新源以json_text用于所有普通字符串以及json_bytes将要传递给JSON的内容如下所示现在可以像预期的那样工作:
my $json_bytes = encode('UTF-8', $json_text);
my $hash_ref1 = JSON->new->utf8->decode($json_bytes);
Run Code Online (Sandbox Code Playgroud)
我必须说,我认为JSON模块的文档非常不清楚,而且部分误导.
短语"text"(至少对我而言)意味着一串字符.因此,在阅读时,$perl_scalar = decode_json $json_text我期望json_text是UTF-8编码的字符串.彻底重新阅读文档,知道要查找的内容,我现在看到它说:"decode_json ...需要一个UTF-8(二进制)字符串,并尝试将其解析为UTF-8编码的JSON文本",但是我认为还不清楚.
在我的背景中使用一种具有一些额外的非ASCII字符的语言,我记得在那些你不得不猜测正在使用的代码页的日子里,电子邮件过去常常通过剥离第8位来削弱文本等等."二进制"在字符串的上下文中意味着包含7位ASCII域之外的字符的字符串.但究竟什么是"二元"呢?是不是所有的字符串都是二进制的?
文档还说"简单快速的接口(期望/生成UTF-8)"和"正确的unicode处理",第一点在"功能"下,两者都没有提到任何附近它不想要字符串而是字节序列.我会要求作者至少更清楚.
bri*_*foy 14
我扩展了我的答案,了解字符串和UTF-8字符串之间的区别.
从阅读JSON文档开始,我认为这些函数不需要字符串,但这就是你想要的东西.相反,他们想要一个"UTF-8二进制字符串".这对我来说似乎很奇怪,但我猜测它主要是直接从HTTP消息中获取输入而不是直接在程序中输入的内容.这是有效的,因为我创建了一个字符串,它是字符串的UTF-8编码版本:
use v5.14;
use utf8;
use warnings;
use feature qw< unicode_strings >;
use Data::Dumper;
use Devel::Peek;
use JSON;
my $filename = 'hei.txt';
my $char_string = qq( { "my_test" : "hei på deg" } );
open my $fh, '>:encoding(UTF-8)', $filename;
print $fh $char_string;
close $fh;
{
say '=' x 70;
my $byte_string = qq( { "my_test" : "hei p\303\245 deg" } );
print "Byte string peek:------\n"; Dump( $byte_string );
decode( $byte_string );
}
{
say '=' x 70;
my $raw_string = do {
open my $fh, '<:raw', $filename;
local $/; <$fh>;
};
print "raw string peek:------\n"; Dump( $raw_string );
decode( $raw_string );
}
{
say '=' x 70;
my $char_string = do {
open my $fh, '<:encoding(UTF-8)', $filename;
local $/; <$fh>;
};
print "char string peek:------\n"; Dump( $char_string );
decode( $char_string );
}
sub decode {
my $string = shift;
my $hash_ref2 = eval { decode_json( $string ) };
say "Error in sub form: $@" if $@;
print Dumper( $hash_ref2 );
my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
say "Error in method form: $@" if $@;
print Dumper( $hash_ref1 );
}
Run Code Online (Sandbox Code Playgroud)
输出显示字符串不起作用,但字节字符串版本执行:
======================================================================
Byte string peek:------
SV = PV(0x100801190) at 0x10089d690
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x100209890 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
CUR = 31
LEN = 32
$VAR1 = {
'my_test' => "hei p\x{e5} deg"
};
$VAR1 = {
'my_test' => "hei p\x{e5} deg"
};
======================================================================
raw string peek:------
SV = PV(0x100839240) at 0x10089d780
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x100212260 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
CUR = 31
LEN = 32
$VAR1 = {
'my_test' => "hei p\x{e5} deg"
};
$VAR1 = {
'my_test' => "hei p\x{e5} deg"
};
======================================================================
char string peek:------
SV = PV(0x10088f3b0) at 0x10089d840
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1002017b0 " { \"my_test\" : \"hei p\303\245 deg\" } "\0 [UTF8 " { "my_test" : "hei p\x{e5} deg" } "]
CUR = 31
LEN = 32
Error in sub form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 51.
$VAR1 = undef;
Error in method form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 55.
$VAR1 = undef;
Run Code Online (Sandbox Code Playgroud)
因此,如果您将您直接键入的字符串转换为程序,并将其转换为UTF-8编码的字节字符串,则可以:
use v5.14;
use utf8;
use warnings;
use feature qw< unicode_strings >;
use Data::Dumper;
use Encode qw(encode_utf8);
use JSON;
my $char_string = qq( { "my_test" : "hei på deg" } );
my $string = encode_utf8( $char_string );
decode( $string );
sub decode {
my $string = shift;
my $hash_ref2 = eval { decode_json( $string ) };
say "Error in sub form: $@" if $@;
print Dumper( $hash_ref2 );
my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
say "Error in method form: $@" if $@;
print Dumper( $hash_ref1 );
}
Run Code Online (Sandbox Code Playgroud)
我认为JSON应该足够智能来处理这个问题,所以你不必在这个级别上思考,但这就是它的方式(到目前为止).
文档说
$perl_hash_or_arrayref = decode_json $utf8_encoded_json_text;
Run Code Online (Sandbox Code Playgroud)
然而,在将输入传递给decode_json之前,您会尽力解码输入.
use strict;
use warnings;
use utf8;
use Data::Dumper qw( Dumper );
use Encode qw( encode );
use JSON qw( );
for my $json_text (
qq{{ "my_test" : "hei på deg" }\n},
qq{{ "water" : "?" }\n},
) {
my $json_utf8 = encode('UTF-8', $json_text); # Counteract "use utf8;"
my $data = JSON->new->utf8->decode($json_utf8);
local $Data::Dumper::Useqq = 1;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Indent = 0;
print(Dumper($data), "\n");
}
Run Code Online (Sandbox Code Playgroud)
输出:
{"my_test" => "hei p\x{e5} deg"}
{"water" => "\x{6c34}"}
Run Code Online (Sandbox Code Playgroud)
PS - 如果您没有两页代码来演示一个简单的问题,那么帮助您会更容易.
| 归档时间: |
|
| 查看次数: |
15479 次 |
| 最近记录: |