Text :: SpellChecker模块和Unicode

Question

Text :: SpellChecker模块和Unicode

Cha*_*hak 6 unicode perl perl-module utf-8

#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

Run Code Online (Sandbox Code Playgroud)

输出: Bad word is rdinator

期望: Bad word is coördinator

如果我有Unicode,那么该模块正在破碎$text.知道如何解决这个问题？

我安装了Aspell 0.50.5,这个模块正在使用它.我认为这可能是罪魁祸首.

编辑:作为Text::SpellChecker要求要么Text::Aspell或者Text::Hunspell,我取出Text::Aspell并安装Hunspell,Text::Hunspell,则:

$ hunspell -d en_US -l < badword.txt
coördinator

Run Code Online (Sandbox Code Playgroud)

显示正确的结果.这意味着我的代码或Text :: SpellChecker出现了问题.

考虑到米勒的建议我做了以下

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

Run Code Online (Sandbox Code Playgroud)

OUTPUT:

Flag is 1
Text is coördinator
Bad word is rdinator

Run Code Online (Sandbox Code Playgroud)

这是否意味着模块无法正确处理utf8字符？

Answer 1

AnF*_*nFi 4

这是 Text::SpellChecker 错误 - 当前版本假定仅 ASCII 单词。

\n\n

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

\n\n

#\n# next_word\n# \n# Get the next misspelled word. \n# Returns false if there are no more.\n#\nsub next_word {\n    ...\n    while ($self->{text} =~ m/([a-zA-Z]+(?:\'[a-zA-Z]+)?)/g) {\n

Run Code Online (Sandbox Code Playgroud)\n\n

恕我直言，最好的解决办法是使用每种语言/区域的分词正则表达式，或者将分词留给使用的底层库。aspell list报告co\xc3\xb6rdinator为单个单词。

\n

归档时间：	11 年，6 月前
查看次数：	249 次
最近记录：	11 年，6 月前