Cha*_*hak 6 unicode perl perl-module utf-8
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );
while ( my $word = $checker->next_word ) {
print "Bad word is $word\n";
}
Run Code Online (Sandbox Code Playgroud)
输出: Bad word is rdinator
期望: Bad word is coördinator
如果我有Unicode,那么该模块正在破碎$text.知道如何解决这个问题?
我安装了Aspell 0.50.5,这个模块正在使用它.我认为这可能是罪魁祸首.
编辑:作为Text::SpellChecker要求要么Text::Aspell或者Text::Hunspell,我取出Text::Aspell并安装Hunspell,Text::Hunspell,则:
$ hunspell -d en_US -l < badword.txt
coördinator
Run Code Online (Sandbox Code Playgroud)
显示正确的结果.这意味着我的代码或Text :: SpellChecker出现了问题.
考虑到米勒的建议我做了以下
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text = "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
print "Bad word is $word\n";
}
Run Code Online (Sandbox Code Playgroud)
OUTPUT:
Flag is 1
Text is coördinator
Bad word is rdinator
Run Code Online (Sandbox Code Playgroud)
这是否意味着模块无法正确处理utf8字符?
这是 Text::SpellChecker 错误 - 当前版本假定仅 ASCII 单词。
\n\nhttp://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm
\n\n#\n# next_word\n# \n# Get the next misspelled word. \n# Returns false if there are no more.\n#\nsub next_word {\n ...\n while ($self->{text} =~ m/([a-zA-Z]+(?:\'[a-zA-Z]+)?)/g) {\nRun Code Online (Sandbox Code Playgroud)\n\n恕我直言,最好的解决办法是使用每种语言/区域的分词正则表达式,或者将分词留给使用的底层库。aspell list报告co\xc3\xb6rdinator为单个单词。