将十六进制转换为 UTF8 在 perl 中无法按预期工作

Dan*_*ley 2 perl utf-8

我试图在 perl 中理解 UTF8。

我有以下字符串 Alizéh。如果我查找这个字符串的十六进制,我会从https://onlineutf8tools.com/convert-utf8-to-hexadecimal得到 416c697ac3a968 (这与这个字符串的原始来源匹配)。

所以我认为打包该十六进制并将其编码为 utf8 应该会产生 unicode 字符串。但它产生了非常不同的东西。

有没有人能够解释我的错误?

这是一个简单的测试程序来展示我的工作。

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";
Run Code Online (Sandbox Code Playgroud)

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish
Run Code Online (Sandbox Code Playgroud)

有关如何获取 UTF8 字符串的十六进制值并将其转换为 perl 中有效 UTF8 标量的任何提示?

我将在这个扩展版本中解释一些更奇怪的地方

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

print "=========================================== Unaccent test start\n";

my $plaintest = unac_string('utf8', "Alizéh");

print "Alizéh passed to the unaccent gives $plaintest\n";


my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as  $cleanpackedHexIntoPlainString\n";

my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Unaccenting the packed version gives $packedtest\n";

utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";

$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Now unaccenting the packed version gives $packedtest\n";

print "=========================================== Unaccent test finish\n\n";
Run Code Online (Sandbox Code Playgroud)

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as  Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish
Run Code Online (Sandbox Code Playgroud)

在这个测试中,unaccent 库似乎接受字符串十六进制的打包版本。我不知道为什么,有人能帮我理解为什么会这样吗?

Gri*_*nnz 5

Unicode 字符串是 Perl 中的一流值,您无需跳过这些圈套。你只需要识别和跟踪你什么时候有字节,什么时候有字符,Perl 不会为你区分,所有的字节字符串也是有效的字符串。实际上,您正在对您的字符串进行双重编码,这些字符串仍然作为 UTF-8 编码字节表示(对应于)您的 UTF-8 编码字节的有效字符。

use utf8;将从 UTF-8 解码您的源代码,因此通过声明您的以下文字字符串已经是 unicode 字符串并且可以传递给任何正确接受字符的 API。要从一串 UTF-8 字节(如您通过打包字节的十六进制表示生成)中获得相同的内容,请使用Encode 中的 decode(或我更好的包装器)。

use strict;
use warnings;
use utf8;
use Encode 'decode';

my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;
Run Code Online (Sandbox Code Playgroud)

Unicode 字符串需要编码为 UTF-8 以输出到需要字节的内容,例如 STDOUT;一个:encoding(UTF-8)层可以被应用到这样的手柄来自动执行此操作,并且在同一自动从输入句柄进行解码。应该应用什么的确切性质完全取决于你的角色来自哪里以及他们要去哪里。有关可用选项的太多信息,请参阅此答案

use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";
Run Code Online (Sandbox Code Playgroud)

  • 完美的。我只想补充一下 `binmode *STDOUT, ':encoding(UTF-8)'` 是 `use open ':std', ':encoding(UTF-8)';` 完成的事情之一 (2认同)
  • Re "*`my $str = 'Alizéh'; # 已解码`*",由于 `use utf8;` 已经解码,以防不清楚。 (2认同)