nor*_*yno 23 regex unicode perl collation
例如,匹配"民族报"在""国际化"没有额外的模块,是否有可能在新的Perl版本(5.14,5.15等)?
我找到了答案!感谢tchrist
与UCA匹配的Rigth解决方案(thnx到/sf/users/32989071/).
# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict;
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñ?érñå?îöñå?îžå?îöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
normalization => undef, level => 1
);
my @match = $Collator->match($str, $look);
if (@match) {
my $found = $match[0];
my $f_len = length($found);
say "match result: $found (length is $f_len)";
my $offset = 0;
while ((my $start = index($str, $found, $offset)) != -1) {
my $end = $start + $f_len;
say sprintf("found at: %s,%s", $start, $end);
$offset = $end + 1;
}
}
Run Code Online (Sandbox Code Playgroud)
来自http://www.perlmonks.org/?node_id=485681的错误(但有效)解决方案
神奇的代码是:
$str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;
Run Code Online (Sandbox Code Playgroud)
代码示例:
use 5.014;
use utf8;
use Unicode::Normalize;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñ?érñå?îöñå?îžå?îöñ";
my $look = "Nation";
say "before: $str\n";
$str = NFD($str);
# M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
$str =~ s/\pM//og; # remove "marks"
say "after: $str";¬
say "is_match: ", $str =~ /$look/i || 0;
Run Code Online (Sandbox Code Playgroud)
UCA的正确解决方案(thnx to tchrist):
# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñ?érñå?îöñå?îžå?îöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
normalization => undef, level => 1
);
my @match = $Collator->match($str, $look);
say "match ok!" if @match;
Run Code Online (Sandbox Code Playgroud)
PS"代码,假设你可以删除变音符号以获得基本的ASCII字母是邪恶的,仍然,破坏,脑损坏,错误和死刑的理由." © tchrist 为什么现代Perl默认会避免使用UTF-8?
"没有额外的模块"是什么意思?
这是一个解决方案,use Unicode::Normalize;
请参阅perl doc
我从你的字符串中删除了"ţ"和"ļ",我的eclipse不想用它们保存脚本.
use strict;
use warnings;
use UTF8;
use Unicode::Normalize;
my $str = "Îñtérñåtîöñålîžåtîöñ";
for ( $str ) { # the variable we work on
## convert to Unicode first
## if your data comes in Latin-1, then uncomment:
#$_ = Encode::decode( 'iso-8859-1', $_ );
$_ = NFD( $_ ); ## decompose
s/\pM//g; ## strip combining characters
s/[^\0-\x80]//g; ## clear everything else
}
if ($str =~ /nation/) {
print $str . "\n";
}
Run Code Online (Sandbox Code Playgroud)
输出是
Internationaliation
"ž"从字符串中删除,似乎不是一个组合字符.
for循环的代码是从这一侧如何从字符中删除变音符号
另一个有趣的读物是绝对最小的每个软件开发人员,绝对必须知道关于Unicode和字符集(没有借口!)来自Joel Spolsky
更新:
正如@tchrist所指出的,存在一种更适合的算法,称为UCA(Unicode Collation Algorithm).@nordicdyno,已在他的问题中提供了一个实现.
该算法在此处描述了Unicode Technical Standard#10,Unicode Collation Algorithm
perl模块在perldoc.perl.org上有所描述