Pau*_*cas 5 c++ unicode diacritics icu
有人可以提供一些示例代码来剥离变音符号(即,替换具有重音符号,变音符号等的字符,以及它们的无重音,无语音等字符等价物,例如,每个重音符é将成为纯ASCII e)来自UnicodeString使用C++中的ICU库?例如:
UnicodeString strip_diacritics( UnicodeString const &s ) {
UnicodeString result;
// ...
return result;
}
Run Code Online (Sandbox Code Playgroud)
假设s已经规范化了.谢谢.
Que*_*det 16
ICU允许您使用特定规则来音译字符串.我的规则是NFD; [:M:] Remove; NFC:分解,删除变音符号,重构.以下代码将UTF-8 std::string作为输入并返回另一个UTF-8 std::string:
#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>
std::string desaxUTF8(const std::string& str) {
// UTF-8 std::string -> UTF-16 UnicodeString
UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));
// Transliterate UTF-16 UnicodeString
UErrorCode status = U_ZERO_ERROR;
Transliterator *accentsConverter = Transliterator::createInstance(
"NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
accentsConverter->transliterate(source);
// TODO: handle errors with status
// UTF-16 UnicodeString -> UTF-8 std::string
std::string result;
source.toUTF8String(result);
return result;
}
Run Code Online (Sandbox Code Playgroud)
Pau*_*cas -1
在其他地方进行更多搜索后:
UErrorCode status = U_ZERO_ERROR;
UnicodeString result;
// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
// complain
// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided
string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
char const c = *i;
if ( isascii( c ) )
buf8.push_back( c );
}
// result is in buf8
Run Code Online (Sandbox Code Playgroud)
这是 O(n)。