使用Encode :: encode with"utf8"

Ste*_*hen 7 perl

你可能知道,在Perl中,"utf8"意味着Perl对UTF-8的理解更为宽松,它允许在技术上不是UTF-8中有效代码点的字符.相比之下,"UTF-8"(或"utf-8")是Perl对UTF-8更严格的理解,它不允许无效的代码点.

我有一些与此区别相关的使用问题:

  1. 默认情况下,Encode :: encode将使用替换字符替换无效字符.即使您将更宽松的"utf8"作为编码传递,这是真的吗?

  2. 当您读写open"使用"UTF-8" 文件时会发生什么?角色替换是否发生在糟糕的角色上或是否发生了其他事情?

  3. 使用open像'>:utf8'这样的图层和像'>:encoding(utf8)'这样的图层有什么区别?这两种方法都可以同时用于'utf8'和'UTF-8'吗?

ike*_*ami 11

                   ?????????????????????????????????????????????????????????????????????
                   ?                                            ?                      ?
                   ?                  On Read                   ?       On Write       ?
                   ?                                            ?                      ?
        Perl       ?????????????????????????????????????????????????????????????????????
        5.26       ?                     ?                      ?                      ?
                   ? Invalid encoding    ? Outside of Unicode,  ? Outside of Unicode,  ?
                   ? other than sequence ? Unicode nonchar, or  ? Unicode nonchar, or  ?
                   ? length              ? Unicode surrogate    ? Unicode surrogate    ?
                   ?                     ?                      ?                      ?
????????????????????????????????????????????????????????????????????????????????????????
?                  ?                     ?                      ?                      ?
? :encoding(UTF-8) ? Warns and Replaces  ? Warns and Replaces   ? Warns and Replaces   ?
?                  ?                     ?                      ?                      ?
????????????????????????????????????????????????????????????????????????????????????????
?                  ?                     ?                      ?                      ?
? :encoding(utf8)  ? Warns and Replaces  ? Accepts              ? Warns and Accepts    ?
?                  ?                     ?                      ?                      ?
????????????????????????????????????????????????????????????????????????????????????????
?                  ?                     ?                      ?                      ?
? :utf8            ? Corrupt scalar      ? Accepts              ? Warns and Accepts    ?
?                  ?                     ?                      ?                      ?
????????????????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

如果您无法查看上述内容,请单击此处

请注意,:encoding(UTF-8)实际上使用utf8解码,然后检查范围是否允许字符(因为它识别"\x{20_000}"甚至"\x{1000_0000_0000_0000}").这减少了错误消息的数量,因此它很好.

(编码名称不区分大小写.)


测试:

在读

  • :encoding(UTF-8)

    printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(UTF-8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "\xFFFF" does not map to Unicode.
    utf8 "\xD800" does not map to Unicode.
    utf8 "\x200000" does not map to Unicode.
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
    5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
    5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
    Run Code Online (Sandbox Code Playgroud)
  • :encoding(utf8)

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    utf8 "\x80" does not map to Unicode.
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
    Run Code Online (Sandbox Code Playgroud)
  • :utf8

    $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
       perl -MB -nle'
          use open ":std", ":encoding(utf8)";
          my $sv = B::svref_2object(\$_);
          printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
       '
    E9 (internal: C3.A9, UTF8=1)
    FFFF (internal: EF.BF.BF, UTF8=1)
    D800 (internal: ED.A0.80, UTF8=1)
    200000 (internal: F8.88.80.80.80, UTF8=1)
    Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
    0 (internal: 80, UTF8=1)
    
    Run Code Online (Sandbox Code Playgroud)

在写

  • :encoding(UTF-8)

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
    "\x{ffff}" does not map to utf8.
    "\x{d800}" does not map to utf8.
    "\x{200000}" does not map to utf8.
    
    $ od -t c a
    0000000 303 251  \n   \   x   {   F   F   F   F   }  \n   \   x   {   D
    0000020   8   0   0   }  \n   \   x   {   2   0   0   0   0   0   }  \n
    0000040
    
    $ cat a
    é
    \x{FFFF}
    \x{D800}
    \x{200000}
    
    Run Code Online (Sandbox Code Playgroud)
  • :encoding(utf8)

    $ perl -e'
       use open ":std", ":encoding(utf8)";
       print "\x{E9}\n";
       print "\x{FFFF}\n";
       print "\x{D800}\n";
       print "\x{20_0000}\n";
    ' >a
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
    Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
    
    $ od -t c a
    0000000 303 251  \n 355 240 200  \n 370 210 200 200 200  \n
    0000015
    
    $ cat a
    é
    ?
    ?
    
    Run Code Online (Sandbox Code Playgroud)
  • :utf8

    结果相同:encoding(utf8).

使用Perl 5.26测试.


默认情况下,Encode :: encode将使用替换字符替换无效字符.即使您将更宽松的"utf8"作为编码传递,这是真的吗?

Perl字符串是32位或64位字符的字符串,具体取决于构建.utf8可以编码任何72位整数.因此,它能够编码可以被要求编码的所有字符.