egi*_*hri 4 unicode perl multilingual locale collation
我正在构建一个用于对不同语言的书籍索引进行排序的软件.它使用Perl,并关闭语言环境.我在Unix上开发它,但它需要可移植到Windows.如果这在原则上是有效的,还是依靠现场工作,我是否会咆哮错误的树?总而言之,Windows实际上是我需要它工作的地方,但我更适合在UNIX环境中开发.
tch*_*ist 11
假设您的起点是Unicode,因为无论原始编码是什么,您都非常小心地解码所有传入数据,因此很容易将其用作Unicode::Collate模块的起点.
如果您想要定制区域设置,那么您可能希望从头开始Unicode::Collate::Locale.
如果你在一个全UTF8环境中运行,这很容易,但是如果你受到随机所谓的"语言环境"的变迁(或者更糟糕的是微软称之为"代码页"的丑陋事物),那么你可能想要让CPAN Encode::Locale模块帮助你.例如:
use Encode;
use Encode::Locale;
# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_) } @ARGV;
# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
Run Code Online (Sandbox Code Playgroud)
(如果是我,我只会":utf8"用于输出.)
问题的关键是,一旦你拥有了一切解码成内部Perl的格式,你可以用Unicode::Collate和Unicode::Collate::Locale在其上.这些非常简单:
use v5.14;
use utf8;
use Unicode::Collate;
my @exes = qw( x? x? x? x³ x? x? x? x² x? x¹ );
@exes = Unicode::Collate->new->sort(@exes);
say "@exes";
# prints: x? x¹ x² x³ x? x? x? x? x? x?
Run Code Online (Sandbox Code Playgroud)
或者他们可能很漂亮.这是一个试图处理书名的人:它剥离了主要文章和零垫数字.
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);
Run Code Online (Sandbox Code Playgroud)
现在只需使用该对象的sort方法进行排序.
有时您需要将内部排序.例如:
my $collator = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;
Run Code Online (Sandbox Code Playgroud)
您必须这样做的原因是因为您正在对包含各种字段的记录进行排序.二进制排序键允许您对cmp通过所选/自定义collator对象的数据使用运算符.
collator对象的完整构造函数具有正式语法的所有这些:
$Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \@levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \@charList,
rewrite => \&rewrite,
suppress => \@charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);
Run Code Online (Sandbox Code Playgroud)
但是你通常不必担心几乎所有这些.实际上,如果您希望使用CLDR数据进行特定于国家/地区的区域设置定制,则应该使用Unicode::Collate::Locale,这会在构造函数中再添加一个参数:locale => $country_code.
use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
@french_text = $coll->sort(@french_text);
Run Code Online (Sandbox Code Playgroud)
看看这有多容易?
但你也可以做其他很酷的事情.
use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;
my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
}
Run Code Online (Sandbox Code Playgroud)
运行时,说:
Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
Run Code Online (Sandbox Code Playgroud)
以下是该Unicode::Collate::Locale模块v0.96的可用语言环境,取自其联机帮助页:
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).
Note
[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.
[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.
[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
Run Code Online (Sandbox Code Playgroud)
总而言之,主要技巧是将本地数据解码为统一的Unicode表示,然后使用可能定制的确定性排序,不依赖于用户控制台窗口的随机设置以获得正确的行为.
注:所有这些例子,除了手册页引用,亲切是从4解禁日版编程的Perl,其作者的一种许可.:)
| 归档时间: |
|
| 查看次数: |
995 次 |
| 最近记录: |