Bar*_*ata 5 unicode perl character-encoding
我有一个文件,每行有一个短语/术语,我从STDIN读到perl.我有一个停用词列表(如"á","são","é"),我希望将每个词与每个词进行比较,如果它们相等则删除.问题是我不确定文件的编码格式.
我从file命令中得到了这个:
words.txt: Non-ISO extended-ASCII English text
Run Code Online (Sandbox Code Playgroud)
我的linux终端是UTF-8,它显示了某些单词的正确内容,而其他单词则没有.以下是其中一些的输出:
condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos
Run Code Online (Sandbox Code Playgroud)
您可以看到第3行和第5行正确识别带有重音符和特殊字符的单词,而其他行则没有.其他行的正确输出应为:condiã,conteúdos和moçambique.
如果我使用binmode(STDOUT, utf8)"错误"行现在正确输出而其他人没有.例如第3行:
ajuda,masnão决心
我该怎么办?
它的工作原理如下:
\n\nC:\\Dev\\Perl :: chcp\nAktive Codepage: 1252.\n\nC:\\Dev\\Perl :: type mixed-encoding.txt\neins zwei drei K\xc3\x83\xc2\xa4se vier f\xc3\x83\xc2\xbcnf Wurst\neins zwei drei K\xc3\xa4se vier f\xc3\xbcnf Wurst\n\nC:\\Dev\\Perl :: perl mixed-encoding.pl < mixed-encoding.txt\neins zwei drei vier f\xc3\xbcnf\neins zwei drei vier f\xc3\xbcnf\nRun Code Online (Sandbox Code Playgroud)\n\n在哪里mixed-encoding.pl:
use strict;\nuse warnings;\nuse utf8; # source in UTF-8\nuse Encode \'decode_utf8\';\nuse List::MoreUtils \'any\';\n\nmy @stopwords = qw( K\xc3\xa4se Wurst );\n\nwhile ( <> ) { # read octets\n chomp;\n my @tokens;\n for ( split /\\s+/ ) {\n # Try UTF-8 first. If that fails, assume legacy Latin-1.\n my $token = eval { decode_utf8 $_, Encode::FB_CROAK };\n $token = $_ if $@;\n push @tokens, $token unless any { $token eq $_ } @stopwords;\n }\n print "@tokens\\n";\n}\nRun Code Online (Sandbox Code Playgroud)\n\n请注意,脚本不必采用 UTF-8 编码。只是如果你的脚本中有时髦的字符数据,你必须确保编码匹配,所以use utf8如果您的编码是 UTF-8,则不要。
根据基督的合理建议进行更新:
\n\nuse strict;\nuse warnings;\n# source in Latin1\nuse Encode \'decode\';\nuse List::MoreUtils \'any\';\n\nmy @stopwords = qw( K\xc3\xa4se Wurst );\n\nwhile ( <> ) { # read octets\n chomp;\n my @tokens;\n for ( split /\\s+/ ) {\n # Try UTF-8 first. If that fails, assume 8-bit encoding.\n my $token = eval { decode utf8 => $_, Encode::FB_CROAK };\n $token = decode Windows1252 => $_, Encode::FB_CROAK if $@;\n push @tokens, uc $token unless any { $token eq $_ } @stopwords;\n }\n print "@tokens\\n";\n}\nRun Code Online (Sandbox Code Playgroud)\n