boost :: property_tree :: json_parser和两个字节的宽字符

Question

boost :: property_tree :: json_parser和两个字节的宽字符

Jar*_*łka 8 c++ unicode boost boost-propertytree

介绍

std::string text = "á";

Run Code Online (Sandbox Code Playgroud)

"á"是双字节字符(假设采用UTF-8编码).
所以下面的行打印2.

std::cout << text.size() << "\n";

Run Code Online (Sandbox Code Playgroud)

但std::cout仍然正确打印文本.

std::cout << text << "\n";

Run Code Online (Sandbox Code Playgroud)

我的问题

我text转到boost::property_tree::ptree然后去write_json

boost::property_tree::ptree root;
root.put<std::string>("text", text);

std::stringstream ss;
boost::property_tree::json_parser::write_json(ss, root);
std::cout << ss.str() << "\n";

Run Code Online (Sandbox Code Playgroud)

结果是

{
    "text": "\u00C3\u00A1"
}

Run Code Online (Sandbox Code Playgroud)

text等于"¡",与"á"不同.

有没有切换到可以解决这个问题std::wstring？是否有可能更改库(boost::property_tree::ptree)可以解决此问题？

Answer 1

Arp*_*ius 10

我找到了一些解决方案通常,您需要指定boost::property_tree::json_parser::create_escapes模板[Ch=Char],以提供"特殊场合无错误转义".

JSON标准假设所有字符串都是UTF-16编码,并带有"\ uXXXX"转义,但某些库支持UTF-8编码,并带有"\ xXX"转义.如果JSON文件可以用UTF-8编码,你可以传递高于0x7F的所有字符,这是用于原始功能的.

我在使用前把这段代码boost::property_tree::json_parser::write_json.它来自boost_1_49_0/boost/property_tree/detail/json_parser_write.hpp:

namespace boost { namespace property_tree { namespace json_parser
{
    // Create necessary escape sequences from illegal characters
    template<>
    std::basic_string<char> create_escapes(const std::basic_string<char> &s)
    {
        std::basic_string<char> result;
        std::basic_string<char>::const_iterator b = s.begin();
        std::basic_string<char>::const_iterator e = s.end();
        while (b != e)
        {
            // This assumes an ASCII superset. But so does everything in PTree.
            // We escape everything outside ASCII, because this code can't
            // handle high unicode characters.
            if (*b == 0x20 || *b == 0x21 || (*b >= 0x23 && *b <= 0x2E) ||
                (*b >= 0x30 && *b <= 0x5B) || (*b >= 0x5D && *b <= 0xFF)  //it fails here because char are signed
                || (*b >= -0x80 && *b < 0 ) ) // this will pass UTF-8 signed chars
                result += *b;
            else if (*b == char('\b')) result += char('\\'), result += char('b');
            else if (*b == char('\f')) result += char('\\'), result += char('f');
            else if (*b == char('\n')) result += char('\\'), result += char('n');
            else if (*b == char('\r')) result += char('\\'), result += char('r');
            else if (*b == char('/')) result += char('\\'), result += char('/');
            else if (*b == char('"'))  result += char('\\'), result += char('"');
            else if (*b == char('\\')) result += char('\\'), result += char('\\');
            else
            {
                const char *hexdigits = "0123456789ABCDEF";
                typedef make_unsigned<char>::type UCh;
                unsigned long u = (std::min)(static_cast<unsigned long>(
                                                 static_cast<UCh>(*b)),
                                             0xFFFFul);
                int d1 = u / 4096; u -= d1 * 4096;
                int d2 = u / 256; u -= d2 * 256;
                int d3 = u / 16; u -= d3 * 16;
                int d4 = u;
                result += char('\\'); result += char('u');
                result += char(hexdigits[d1]); result += char(hexdigits[d2]);
                result += char(hexdigits[d3]); result += char(hexdigits[d4]);
            }
            ++b;
        }
        return result;
    }
} } }

Run Code Online (Sandbox Code Playgroud)

我得到的输出:

{
    "text": "aáb"
}

Run Code Online (Sandbox Code Playgroud)

此函数boost::property_tree::json_parser::a_unicode也有类似的问题,将转义的unicode字符读取到签名的字符.

在0x7F上编码Unicode字符的所有字节都高于0x7F(对于signed char,低于0),因此该函数正确地通过UTF-8.当然,某些unicode字符可能无法打印,并且必须永远不会出现某些UTF-8序列. (2认同)

归档时间：	13 年，4 月前
查看次数：	5023 次
最近记录：	10 年，4 月前