Sek*_*eki 4 unicode perl byte-order-mark newline utf-16
我需要在Windows 7机器上生成一些带有CRLF行分隔符的UTF-16LE编码文件.(目前有草莓5.20.1)
在获得正确的输出之前我需要很长时间才能搞清楚,我想知道我的解决方案是否是正确的方法,因为它在Perl上的其他语言看起来过于复杂.特别是:
encoding(UTF-16)而如果我使用UTF-16LE或UTF-16BE不使用其他软件包,则没有BOM File::BOM?CRLF处理似乎有错误(它输出0D 0A 00而不是0D 00 0A 00)没有过滤器的一些麻烦?我怀疑这对于拥有这么多用户的语言来说可能是一个真正的错误......以下是我的评论尝试,我发现正确的是最后的陈述
use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';
my $UTF;
my $data = "Hello, héhé, ??.\nsecond line : my 2€"; # ?? = zhong wen = chinese
# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;
# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE non raw incorrect
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;
# UTF16 LE + BOM + LF
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;
# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;
#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;
# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;
#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
Run Code Online (Sandbox Code Playgroud)
手动BOM,但CRLF OK
是的,以下确实是正确的:
:raw:encoding(UTF-16LE):crlf + manual BOM
Run Code Online (Sandbox Code Playgroud)
:raw"清除"现有:crlf和:encoding图层.:encoding 在字节和代码点之间转换.:crlf 在CRLF和LF之间转换.所以,
Read
===================================================>
Code Code
+------+ bytes +------+ Points +-------+ Points +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+ +------+ CRLF +-------+ LF +------+
<===================================================
Write
Run Code Online (Sandbox Code Playgroud)
您希望对代码点(而不是字节)执行CRLF⇔LF转换,就像使用此设置一样.
正确的方式?:自动BOM,CRLF没问题
虽然:raw:encoding(UTF-16LE):crlf:via(File::BOM)可能适用于写句柄,但它看起来并不正确(我原本预料到:raw:via(File::BOM,UTF-16LE):crlf),并且它对于读取句柄来说是悲惨的(至少对我来说是Perl 5.16.3).
我只是看了看,背后的代码:via(File::BOM)做了一些非常值得怀疑的事情.我不会用它.
为什么Perl使用正确的BOM编码(UTF-16)生成有效的UTF-16大端,而如果我使用UTF-16LE或UTF-16BE而不使用额外的包File :: BOM则没有BOM
因为您可能不需要BOM.
为什么开箱即用的
CRLF处理似乎有问题
添加图层会在列表末尾添加它们.如果要在其他位置添加图层(如此处所示),则需要重建列表.
在Perl的开发列表中建议应该有区分字节层(例如:unix)和文本层(例如:crlf)的方法,并且添加字节或编码层应该挖掘并将其放置在适当的位置.但是还没有人对此采取行动.
除了简化您的代码,这将允许一个UTF-16*[1]编码层被添加到STDIN/ STDOUT/ STDERR(或其他现有的手柄).我相信目前还不可能.