如何在 C++ 中通过表情符号分割字符串

Question

如何在 C++ 中通过表情符号分割字符串

我正在尝试获取一串表情符号并将它们拆分为每个表情符号的向量

给定字符串：

std::string emojis = "";

Run Code Online (Sandbox Code Playgroud)

我试图得到：

std::vector<std::string> splitted_emojis = {"", "", "", "", "", "", "", ""};

Run Code Online (Sandbox Code Playgroud)

编辑

我试过这样做：

std::string emojis = "";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

Run Code Online (Sandbox Code Playgroud)

但它似乎terminate called after throwing an instance of 'std::bad_alloc'在几秒钟后抛出。
尝试使用以下方法检查字符串中有多少表情符号时：

std::string emojis = "";
std::cout << emojis.size() << std::endl; // returns 32

Run Code Online (Sandbox Code Playgroud)

它返回一个更大的数字，我认为它是 unicode 数据。我对 unicode 数据不太了解，但我试图弄清楚如何检查表情符号的数据何时开始和结束，以便能够将字符串拆分为每个表情符号

Answer 1

Bot*_*tje 3

我绝对建议您使用具有更好 unicode 支持的库（所有大型框架都这样做），但在紧要关头，您可以通过了解 UTF-8 编码将 Unicode 字符分布在多个字节上，并且知道第一个字节决定一个字符由多少个字节组成。

我从boost偷了一个函数。split_by_codepoint 函数在输入字符串上使用迭代器，并使用前 N 个字节（其中 N 由字节计数函数确定）构造一个新字符串，并将其推送到 ret 向量。

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

Run Code Online (Sandbox Code Playgroud)

请注意，此函数只是将字符串拆分为每个包含一个代码点的 UTF-8 字符串。确定该字符是否是表情符号留作练习：对任何 4 字节字符进行 UTF-8 解码并查看它们是否在正确的范围内。

另请注意，1 个 Unicode 代码点 = 1 个字符并不正确，尤其是在表情符号上。存在占用更多 Unicode 字符的字素簇。例如 `‍❤️‍` 是 5 个 unicode 字符（男脸、心形和女脸，由 ZWJ 连接），或旗帜，由 `U+1F3F4` 挥舞旗帜组成，2-5 个 CLDR 字符表示国家或地区，和`U+E007F` (4认同)

归档时间：	5 年，4 月前
查看次数：	565 次
最近记录：	5 年，4 月前