Hel*_*man 7 unicode perl utf-8 tie
我正在开发一个处理外语数据的项目.我的Perl脚本运行正常.
然后我想使用Tie :: File,因为这是一个简洁的概念(并节省时间和编码).
看起来Tie:File在Unicode/UTF-8下失败了(除非我遗漏了什么).
这是一个描述问题的程序:(数据是英语,希腊语和希伯来语的混合):
use strict;
 use warnings;
 use 5.014; 
 use Win32::Console;
 use autodie; 
 use warnings qw< FATAL utf8 >;
 use Carp;
 use Carp::Always;
 use utf8;
 use feature        qw< unicode_strings>;
 use charnames      qw< :full>;
use Tie::File;
my ($i);
my ( $FileName);
my (@Tied);
binmode STDOUT, ':unix:utf8';
binmode STDERR, ':unix:utf8';
binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger
Win32::Console::OutputCP(65001);         # Set the console code page to UTF8
$FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\Work\\'.
        'Tie File test res.txt';
tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline => ':encoding(utf8)'
            or confess 'tie @Tied failed';
$i =0;
while (<DATA>) {
    chomp;
    $Tied[$i] = $_;
    ++$i;
} # end while (<DATA>) 
$i =0;
foreach (@Tied) {
    say "$i $Tied[$i]";
    ++$i;
} # end foreach (@Tied)
untie $FileName;
__DATA__
?? ??????;
????? ?? ? ?????? ??
???? ?????
abc ?? ???? efg
??? ???? This is it
?????? ?????? 
?????? ????? ?????
???? ?? ???
?? ??????;
???? ??' 5
这会产生大量警告:这里有一些:
utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '?? ??????;') called at tie file test
.pl line 31
utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '?? ??????;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '?? ??????;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '?? ??????;') called at tie file test
.pl line 31
然后它在STDOUT上打印:
0 ?? ??????;
1 ????? ?? ? ?????? ??
2 ???? ?????
3 abc ?? ???? efg
4 ??? ???? This is it
5 ?????? ??????
6 ?????? ????? ?????
7 ???? ?? ???
8 ?? ??????;
9 ???? ??' 5
10
11
12
13
14 \xA4\x????\xA8\x
15
16
17
18
19
请注意,前10行是正常的,但10到19行来自哪里!?此外,绑定文件的输出包含损坏的数据:
 ?? ????N???????? ?? ????bc ??????e?????? This is ???? ??????????? ??????????????;??? ??'
\xA4\x????\xA8\x
这里有点不对劲.我错过了什么,或者Tie:文件无法应对Unicode/UTF-8?我在Windows 7系统上运行Strawberry Perl 5.14.
许多TIA - 海伦
我提出的建议很大程度上取决于您想要解决的实际问题。孤立地看待这个问题,我不会有太多的编码/解码“魔法”,而只会使用原始字节(因为脚本不需要了解有关此任务的字符本身的任何信息)。根据您描述的输入和输出,下面会产生预期结果。
use v5.014;
use warnings;
use autodie;
use Carp::Always;
use Tie::File;
my $file_in = 'test_in.txt';
my $file_out = 'test_tie.txt';
unlink $file_out;
tie my @tied, 'Tie::File', $file_out, recsep => "\x0D\x0A" or die 'tie failed';
open my $fh, '<', $file_in;
while (my $line = <$fh>) {
    chomp $line;
    push @tied, $line;
}
close $fh;
my $i = 0;
say $i++ . ' ' . $_ foreach @tied;
untie @tied;
但是,您可能确实想对中间的文本进行一些处理。在这种情况下,您需要解码字符。据我所知,有两种选择:
第 2 点可能并不简单 - 快速扫描 Tie::File 源,看起来它假设它总是被给定字节。您似乎可以影响的唯一部分是https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111上的 binmode -您正在做的事情。
Tie::File 做了很多seek调用,perldoc 在seek 上有这样的说法(http://perldoc.perl.org/functions/seek.html):
请注意以字节为单位:即使文件句柄已设置为对字符进行操作(例如通过使用 :encoding(utf8) 开放层),tell() 也将返回字节偏移量,而不是字符偏移量(因为实现该操作会呈现 seek( ) 和tell() 相当慢)。
因此,Tie::File 似乎正在使用字符长度来确定其记录的字节偏移量。因此,它可能会出现在 UTF-8 字符序列的中间。这似乎是导致您错误的可能原因。
一般来说,我会远离依赖外部模块来读/写文件句柄的情况 - 在这种情况下,我会在推送到 @tied 之前对数据进行binmode简单的子调用。Encode::encode('UTF-8', ...)
例外情况是模块的文档清楚地说明了解码数据的行为,或者源是否足够简单以供我验证该行为。