为什么字符变得无用？libcurl c++ Utf-8编码的html；

Question

为什么字符变得无用？libcurl c++ Utf-8编码的html；

uoa*_*nci 5 c++ string utf-8 libcurl codepages

首先，抱歉我的英语不好。\n我已经完成了研究，但没有任何相关答案来解决我的问题。\n我已经理解并了解了 CodePages Utf 8 以及有关 c 或 c++ 中的其他内容，\nand还知道字符串可以保存 utf8。\n我的开发机器 winxp english，控制台代码页设置为 1254（Windows 土耳其语），我可以使用土耳其语扩展字符 (\xc4\xb0\xc4\xb1\xc4\x9f\xc5\x9f\xc3 \xa7\xc3\xbc\xc3\xb6) 在 std::string 中，计算它们并将它们发送到 mysqlpp api 以写入数据库。没有问题。但是当我想使用curl 获取一些html 并将其写入std::string 时，我的问题就开始了。

\n\n

#include <iostream>\n#include <windows.h>\n#include <wincon.h>\n#include <curl.h>\n#include <string>\nint main()\n{\n   SetConsoleCP(1254);\n   SetConsoleOutputCP(1254);\n   std::string s;\n   std::cin>>s;\n   std::cout<<s<<std::endl;\n   return 0;\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

当我运行这些并输入 \xc4\x9f\xc5\x9f\xc3\xa7\xc3\xb6\xc3\xbc\xc4\xb0\xc4\xb1 时，输出是相同的 \xc4\x9f\xc5\x9f\xc3\ xa7\xc3\xb6\xc3\xbc\xc4\xb0\xc4\xb1;

\n\n

#include <iostream>\n#include <windows.h>\n#include <wincon.h>\n#include <curl.h>\n#include <string.h>\n\nsize_t writer(char *data, size_t size, size_t nmemb, std::string *buffer);\n{\n   int res;\n   if(buffer!=NULL)\n   {\n      buffer->append(data,size*nmemb);\n      res=size*nmemb;\n   }\n   return res;\n}\nint main()\n{\n   SetConsoleOutputCP(1254);\n   std::string html;\n   CURL *curl;\n   CURLcode result;\n   curl=curl_easy_init();\n   if(curl)\n   {\n      curl_easy_setopt(curl, CURLOPT_URL, "http://site.com");\n      curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer);\n      curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);\n      result=curl_easy_perform(curl);\n      if(result==CURLE_OK)\n      {\n         std::cout<<html<<std::endl;\n      }\n   }\n   return 0;\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

当我编译并运行时；

\n\n

如果html包含\'\xc4\xb1\'打印到cmd \'\xc3\x84\xc2\xb1\'，\'\xc3\xb6\'打印出\'\xc3\x84\xc2\xb6\ ', \'\xc4\x9f\' 打印出 \'\xc3\x84\xc5\xb8\', \'\xc4\xb0\' 打印出 \'\xc3\x84\xcb\x9a\' 等。

\n\n

如果我将代码页更改为 65000，

\n\n

...\nSetConsoleOutputCP(65000);//For utf8\n...\n

Run Code Online (Sandbox Code Playgroud)\n\n

那么结果是相同的，因此问题的原因不是 cmd CodePage。

\n\n

响应http标头表明字符集设置为utf-8并且html元数据是相同的。

\n\n

据我了解，问题的根源是函数“writer”或“curl”本身。传入数据解析为字符，因此扩展字符如 \xc4\xb1、\xc4\xb0、\xc4\x9f 解析为 2 个字符并以这种方式写入字符数组 std::string ，因此代码页相当于这些半字符打印或使用代码中的任何位置（例如 mysqlpp 将该字符串写入数据库）。

\n\n

我不知道如何解决这个问题，也不知道在编写器功能或其他任何地方要做什么。\n我的想法正确吗？如果是这样我该怎么办这个问题？或者问题的根源在其他地方？

\n\n

我使用 mingw32 Windows Xp 32 位 Code::Blocks ide。

\n

Answer 1

sth*_*sth 1

UTF-8 的正确代码页是65001，而不是 65000 。

另外，你检查一下代码页设置是否成功？该SetConsoleOutputCP函数通过其返回值指示成功或失败。

归档时间：	14 年，3 月前
查看次数：	1781 次
最近记录：	14 年，3 月前