如何在 Perl 中确定 unicode 字符是全角还是半角

FER*_*csI 2 unicode perl

在 Perl 中如何确定 unicode 字符是全角(占用两个单元格;双倍宽度)还是半角(如通常的拉丁字符)?

例如,表情符号是双倍宽度的,但在较低的块中也有字符,例如"\N{MEDIUM BLACK CIRCLE}"(U+26ab)。

我试过

Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()
Run Code Online (Sandbox Code Playgroud)

但它也返回 1。

Sha*_*awn 5

我有一些 C++ 代码来计算字符宽度。因此,稍后快速转换为 Perl,并且...

\n
#!/usr/bin/env perl\nuse warnings;\nuse strict;\nuse feature qw/state/;\nuse open qw/:std :locale/;\nuse charnames qw/:full/;\nuse Unicode::UCD qw/charinfo charprop/;\n\n# Return the number of fixed-width columns taken up by a unicode codepoint\n# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c\n# First adapted to use C++/ICU functions and then to perl\nsub charwidth ($) {\n  state %cache;\n\n  my $cp = shift; # Numeric codepoint\n  return $cache{$cp} if exists $cache{$cp};\n\n  if ($cp == 0 || $cp == 0x200B) {\n    # nul and ZERO WIDTH SPACE\n    $cache{$cp} = 0;\n    return 0;\n  } elsif ($cp >= 0x1160 && $cp <= 0x11FF) {\n    # Hangul Jamo vowels and final consonants\n    $cache{$cp} = 0;\n    return 0;\n  } elsif ($cp == 0xAD) {\n    # SOFT HYPHEN\n    $cache{$cp} = 1;\n    return 1;\n  }\n\n  my $ci = charinfo($cp);\n  return undef unless defined $ci;\n\n  my $type = $ci->{category};\n  if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {\n    # Control Code, Non Spacing Mark, Enclosing Mark, Format Char\n    $cache{$cp} = 0;\n    return 0;\n  }\n\n  state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,\n                    Neutral => 1, Ambiguous => 1 };\n  my $eaw = charprop($cp, "East_Asian_Width");\n  my $width = $widths->{$eaw} // 1;\n  $cache{$cp} = $width;\n  return $width;\n}\n\nsub testwidth ($) {\n  my $char = shift;\n  my $cp = ord $char;\n  printf "Width of %c (U+%04X %s) is %d\\n", $cp, $cp, charnames::viacode($cp),\n    charwidth($cp);\n}\n\ntestwidth "\\x04";\ntestwidth "a";\ntestwidth "\\N{MEDIUM BLACK CIRCLE}";\ntestwidth "\\N{LARGE RED CIRCLE}";\ntestwidth "\\N{U+20A9}";\ntestwidth "\\N{U+1F637}";\n
Run Code Online (Sandbox Code Playgroud)\n

使用示例:

\n
$ ./charwidths.pl\nWidth of  (U+0004 END OF TRANSMISSION) is 0\nWidth of a (U+0061 LATIN SMALL LETTER A) is 1\nWidth of \xe2\x9a\xab (U+26AB MEDIUM BLACK CIRCLE) is 2\nWidth of  (U+1F534 LARGE RED CIRCLE) is 2\nWidth of \xe2\x82\xa9 (U+20A9 WON SIGN) is 1\nWidth of  (U+1F637 FACE WITH MEDICAL MASK) is 2\n
Run Code Online (Sandbox Code Playgroud)\n

它只是对代码点的特定范围和类别进行一些特殊情况检查,然后使用东亚宽度属性以及TR11的建议来确定其他所有内容的宽度。

\n

  • @ikegami我正在研究一些可以映射和合并Unicode::UCD使用的invlists/invmaps的代码(我想我已经让它工作了,但是对于一个SO帖子来说有点混乱和笨拙)。最终结果是一个包含 909 个元素的数组,可以使用 search_invlist 对它进行二进制搜索。 (2认同)