在 Perl 中如何确定 unicode 字符是全角(占用两个单元格;双倍宽度)还是半角(如通常的拉丁字符)?
例如,表情符号是双倍宽度的,但在较低的块中也有字符,例如"\N{MEDIUM BLACK CIRCLE}"(U+26ab)。
我试过
Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()
Run Code Online (Sandbox Code Playgroud)
但它也返回 1。
我有一些 C++ 代码来计算字符宽度。因此,稍后快速转换为 Perl,并且...
\n#!/usr/bin/env perl\nuse warnings;\nuse strict;\nuse feature qw/state/;\nuse open qw/:std :locale/;\nuse charnames qw/:full/;\nuse Unicode::UCD qw/charinfo charprop/;\n\n# Return the number of fixed-width columns taken up by a unicode codepoint\n# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c\n# First adapted to use C++/ICU functions and then to perl\nsub charwidth ($) {\n state %cache;\n\n my $cp = shift; # Numeric codepoint\n return $cache{$cp} if exists $cache{$cp};\n\n if ($cp == 0 || $cp == 0x200B) {\n # nul and ZERO WIDTH SPACE\n $cache{$cp} = 0;\n return 0;\n } elsif ($cp >= 0x1160 && $cp <= 0x11FF) {\n # Hangul Jamo vowels and final consonants\n $cache{$cp} = 0;\n return 0;\n } elsif ($cp == 0xAD) {\n # SOFT HYPHEN\n $cache{$cp} = 1;\n return 1;\n }\n\n my $ci = charinfo($cp);\n return undef unless defined $ci;\n\n my $type = $ci->{category};\n if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {\n # Control Code, Non Spacing Mark, Enclosing Mark, Format Char\n $cache{$cp} = 0;\n return 0;\n }\n\n state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,\n Neutral => 1, Ambiguous => 1 };\n my $eaw = charprop($cp, "East_Asian_Width");\n my $width = $widths->{$eaw} // 1;\n $cache{$cp} = $width;\n return $width;\n}\n\nsub testwidth ($) {\n my $char = shift;\n my $cp = ord $char;\n printf "Width of %c (U+%04X %s) is %d\\n", $cp, $cp, charnames::viacode($cp),\n charwidth($cp);\n}\n\ntestwidth "\\x04";\ntestwidth "a";\ntestwidth "\\N{MEDIUM BLACK CIRCLE}";\ntestwidth "\\N{LARGE RED CIRCLE}";\ntestwidth "\\N{U+20A9}";\ntestwidth "\\N{U+1F637}";\nRun Code Online (Sandbox Code Playgroud)\n使用示例:
\n$ ./charwidths.pl\nWidth of (U+0004 END OF TRANSMISSION) is 0\nWidth of a (U+0061 LATIN SMALL LETTER A) is 1\nWidth of \xe2\x9a\xab (U+26AB MEDIUM BLACK CIRCLE) is 2\nWidth of (U+1F534 LARGE RED CIRCLE) is 2\nWidth of \xe2\x82\xa9 (U+20A9 WON SIGN) is 1\nWidth of (U+1F637 FACE WITH MEDICAL MASK) is 2\nRun Code Online (Sandbox Code Playgroud)\n它只是对代码点的特定范围和类别进行一些特殊情况检查,然后使用东亚宽度属性以及TR11的建议来确定其他所有内容的宽度。
\n