ICU迭代代码点

Question

ICU迭代代码点

我的目标是逐个字符迭代Unicode文本字符串,但下面的代码是迭代代码单元而不是代码点,即使我使用next32PostInc(),它应该迭代代码点:

void iterate_codepoints(UCharCharacterIterator &it, std::string &str) {
    UChar32 c;
    while (it.hasNext()) {
        c = it.next32PostInc();
        str += c;
    }
}

void my_test() {
    const char testChars[] = "\xE6\x96\xAF"; // Chinese character ? in UTF-8
    UnicodeString testString(testChars, "");
    const UChar *testText = testString.getTerminatedBuffer();

    UCharCharacterIterator iter(testText, u_strlen(testText));

    std::string str;
    iterate_codepoints(iter, str);
    std::cout << str; // outputs ? in UTF-8 format
}


int main() {
    my_test();
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

上面的代码生成了正确的输出,这是中文字符,但是对于这个单个字符而不是只有1次迭代.有人可以解释我做错了什么吗？

简而言之,我只想在循环中遍历字符,并乐于使用任何ICU迭代类.

仍在努力解决这个问题......

我还观察到使用UnicodeString的一些不良行为,如下所示.我正在使用VC++ 2013.

void test_02() {
    //  UnicodeString us = "abc 123 ñ";     // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1  
    //  UnicodeString us = "?";             // results in bad  UTF-8: 3f
    //  UnicodeString us = "abc 123 ñ ?";  // results in bad  UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f  (only the last part '3f' is corrupt)
    //  UnicodeString us = "\xE6\x96\xAF";  // results in bad  UTF-8: 00 55 24 04 c4 00 24  
    //  UnicodeString us = "\x61";          // results in good UTF-8: 61
    //  UnicodeString us = "\x61\x62\x63";  // results in good UTF-8: 61 62 63
    //  UnicodeString us = "\xC3\xB1";      // results in bad  UTF-8: c3 83 c2 b1  
    UnicodeString us = "ñ";                 // results in good UTF-8: c3 b1    
    std::string cs;
    us.toUTF8String(cs);
    std::cout << cs; // output result to file, i.e.: main >output.txt

Run Code Online (Sandbox Code Playgroud)

}

我正在使用VC++ 2013.

Answer 1

Rem*_*eau 6

由于您的源数据是UTF-8,您需要告诉它UnicodeString.它的构造函数有一个codepage参数用于此目的,但您将其设置为空字符串:

UnicodeString testString(testChars, "");

Run Code Online (Sandbox Code Playgroud)

这告诉UnicodeString你执行一个不变的转换,这不是你想要的.你最终得到3个代码点(U + 00E6 U + 0096 U + 00AF)而不是1个代码点(U + 65AF),这就是你的循环迭代三次的原因.

您需要更改构造函数调用以UnicodeString告知数据是UTF-8,例如:

UnicodeString testString(testChars, "utf-8");

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，4 月前
查看次数：	716 次
最近记录：	11 年，4 月前