如何将代码点转换为 utf-8？

Question

如何将代码点转换为 utf-8？

dar*_*une 3 c++ boost utf-8 boost-locale c++17

我有一些读取 unicode 代码点的代码（在字符串 0xF00 中转义）。

由于我使用boost，我在推测以下是否是最佳（和正确）方法：

unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);

Run Code Online (Sandbox Code Playgroud)

?

Answer 1

Art*_*yer 5

您可以使用标准库std::wstring_convert将 UTF-32（代码点）转换为 UTF-8：

#include <locale>
#include <codecvt>

std::string codepoint_to_utf8(char32_t codepoint) {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
    return convert.to_bytes(&codepoint, &codepoint + 1);
}

Run Code Online (Sandbox Code Playgroud)

这将返回std::string大小为 1、2、3 或 4 的 a，具体取决于大小codepoint。std::range_error如果代码点太大（> 0x10FFFF，最大 unicode 代码点），它将抛出 a 。

您的带有 boost 的版本似乎也在做同样的事情。文档说该utf_to_utf函数将 UTF 编码转换为另一种编码，在本例中为 32 到 8。如果您使用char32_t，这将是一种“正确”的方法，适用于unsigned int大小与char32_t.

// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
    return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}

Run Code Online (Sandbox Code Playgroud)

提醒一下，从 C++17 开始不推荐使用 `wstring_convert` 和 `codecvt_utf8`。标准库中没有替代方案，当前建议使用专用库。 (3认同)

Answer 2

Lig*_*ica 5

如前所述，这种形式的代码点（方便地）是 UTF-32，因此您要寻找的是转码。

对于不依赖自 C++17 以来已弃用的函数、并且不是很丑陋、也不需要大量第三方库的解决方案，您可以使用非常轻量级的UTF8-CPP（四个小标头！）及其功能utf8::utf32to8。

它看起来像这样：

const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;

try
{
   utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
   // something
}

Run Code Online (Sandbox Code Playgroud)

utf8::unchecked::utf32to8（如果您对异常过敏，还有一个, 。）

（并考虑读入vector<char8_t>or std::u8string，自 C++20 起）。

（最后，请注意，我专门用于uint32_t确保输入具有正确的宽度。）

我倾向于在项目中使用这个库，直到我需要一些更重的东西用于其他目的（此时我通常会切换到 ICU）。

归档时间：	6 年，5 月前
查看次数：	628 次
最近记录：	5 年，7 月前