如何测量非ASCII字符的正确大小？

Question

如何测量非ASCII字符的正确大小？

msc*_*msc 6 c++ string size non-ascii-characters c++11

在下面的程序中,我试图用非ASCII字符来测量字符串的长度.

但是,我不确定为什么size()在使用非ASCII字符时不会打印正确的长度.

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "??????"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

Run Code Online (Sandbox Code Playgroud)

输出:

Size of Hello is 5
Size of ?????? is 18

Run Code Online (Sandbox Code Playgroud)

现场演示Wandbox.

Answer 1

cbu*_*art 5

std::string::size返回以字节为单位的长度，而不是以字符数为单位。您的第二个字符串使用 UNICODE 编码，因此每个字符可能需要几个字节。请注意，这同样适用，std::wstring::size因为它将取决于编码（它返回宽字符的数量，而不是实际字符：如果使用 UTF-16，它将匹配但不一定适用于其他编码，更多内容在这个答案中）。

要测量实际长度（以符号数量为单位），您需要知道编码才能正确分离（并因此计算）字符。这个答案例如，

UTF-8 的另一个选项是计算第一个字节的数量（归功于另一个答案）：

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

Run Code Online (Sandbox Code Playgroud)

请注意，code_point 计数可能与抽象字符计数不同。 (2认同)

Answer 2

msc*_*msc 1

我使用了std::wstring_convert类并获得了正确的字符串长度。

\n\n

#include <string>\n#include <iostream>\n#include <codecvt>\n\nint main()\n{\n    std::string s1 = "Hello";\n    std::string s2 = "\xe0\xa4\x87\xe0\xa4\x82\xe0\xa4\xa1\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa4\xbe"; // non-ASCII string\n    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn;\n    auto sz = cn.from_bytes(s2).size();\n    std::cout << "Size of " << s2 << " is " << sz << std::endl;\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

现场演示wandbox。

\n\n

此处重要参考链接了解更多信息std::wstring_convert

\n

归档时间：	8 年，4 月前
查看次数：	781 次
最近记录：	8 年，4 月前