Pal*_*lec 6 perl utf-8 character-encoding perl-io perl5
几个小时了,我正在打击Perl程序中的一个错误.我不确定我是做错了还是解释器做了什么,但代码是非确定性的,而它应该是确定性的,IMO.此外,它在古代Debian Lenny(Perl 5.10.0)和刚刚升级到Debian Wheezy(Perl 5.14.2)的服务器上表现出相同的行为.它归结为这段Perl代码:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Run Code Online (Sandbox Code Playgroud)
它在严格模式下初始化Perl 5解释器并启用警告,使用字符串(而不是字节字符串)和以UTF8编码的命名标准流(UTF-8的内部概念,但非常接近;更改为完整的UTF-8没有区别).然后它打开一个"内存文件"(标量变量)的文件句柄,在其中打印一个双字节UTF-8字符,并在关闭时检查该变量.
标量变量现在总是将UTF8位翻转掉.但是它有时包含一个字节字符串(转换为字符串via utf8::decode()),有时还需要一个只需要翻转其UTF8位(Encode::_utf8_on())的字符串.
当我重复执行我的代码(1000次,通过Bash)时,它打印Undecoded并Decoded具有大致相同的频率.当我更改我写入"文件"的字符串时,例如在其末尾添加换行符,Undecoded消失.当utf8::decode成功,我尝试了在一个循环中相同的原始字符串,它不断在翻译的同一个实例成功; 但是,如果它失败了,它会一直失败.
对观察到的行为有什么解释?如何将文件句柄与字符串一起用于标量变量?
巴什游乐场:
for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l
Run Code Online (Sandbox Code Playgroud)
作为参考并且绝对肯定,我还制作了一个带有迂腐错误处理的版本 - 结果相同.
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Run Code Online (Sandbox Code Playgroud)
细究起来,$c它与内容或内部无关$c,其结果decode准确地代表了它做了什么或没有做什么。
$ for i in {1..2}; do\n perl -MDevel::Peek -we\'\n use strict; use utf8;\n binmode STDOUT, ":utf8";\n binmode STDERR, ":utf8";\n my $c = "";\n open C, ">:utf8", \\$c;\n print C "\xc5\xa1";\n close C;\n die "Does not happen\\n" if utf8::is_utf8($c);\n Dump($c);\n print utf8::decode($c) ? "Decoded\\n" : "Undecoded\\n";\n Dump($c)\n \'\n echo\n done\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n
SV = PV(0x17c8470) at 0x17de990\n REFCNT = 1\n FLAGS = (PADMY,POK,pPOK)\n PV = 0x17d7a40 "\\305\\241"\n CUR = 2\n LEN = 16\nDecoded\nSV = PV(0x17c8470) at 0x17de990\n REFCNT = 1\n FLAGS = (PADMY,POK,pPOK,UTF8)\n PV = 0x17d7a40 "\\305\\241" [UTF8 "\\x{161}"]\n CUR = 2\n LEN = 16\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n
SV = PV(0x2d0fee0) at 0x2d26400\n REFCNT = 1\n FLAGS = (PADMY,POK,pPOK)\n PV = 0x2d1f4b0 "\\305\\241"\n CUR = 2\n LEN = 16\nUndecoded\nSV = PV(0x2d0fee0) at 0x2d26400\n REFCNT = 1\n FLAGS = (PADMY,POK,pPOK)\n PV = 0x2d1f4b0 "\\305\\241"\n CUR = 2\n LEN = 16\nRun Code Online (Sandbox Code Playgroud)\n\n这是 中的一个错误utf8::decode,但在 5.16.3 或更早版本(可能是 5.16.0)中已修复,因为它仍然存在于 5.14.2 中。
一个合适的解决方法是使用 Encodedecode_utf8代替。