Joh*_*han 6 c++ qt utf-8 latin1 qbytearray
我想将QString转换为utf8或latin1 QByteArray,但今天我将所有内容都视为utf8.
而我正在使用高于0x7f的latin1的较高段中的一些字符来测试它,其中德语ü是一个很好的例子.
如果我喜欢这样:
QString name("\u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();
QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();
QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();
Run Code Online (Sandbox Code Playgroud)
我得到以下输出.
utf8 "ü" "c3bc"
Latin1 "ü" "c3bc"
ISO 8859-1 "ü" "c3bc"
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我到处都得到了unicode 0xc3bc,我希望在第2步和第3步得到Latin1 0xfc.
我的猜测是我应该得到这样的东西:
utf8 "ü" "c3bc"
Latin1 "ü" "fc"
ISO 8859-1 "ü" "fc"
Run Code Online (Sandbox Code Playgroud)
这里发生了什么?
/谢谢
链接到一些字符表:
此代码是在基于Ubuntu 10.04的系统上构建和执行的.
$> uname -a
Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
$> env | grep LANG
LANG=en_US.utf8
Run Code Online (Sandbox Code Playgroud)
如果我尝试使用
utf8.append(name.toUtf8());
Run Code Online (Sandbox Code Playgroud)
我得到了这个输出
utf8 "ü" "c383c2bc"
Latin1 "ü" "c3bc"
ISO 8859-1 "ü" "c3bc"
Run Code Online (Sandbox Code Playgroud)
所以latin1是unicode,utf8是双重编码的......
这必须取决于一些系统设置?
如果我运行它(无法获取.name()来构建)
qDebug() << "system name:" << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name();
Run Code Online (Sandbox Code Playgroud)
然后我明白了:
system name: "en_US"
codecForCStrings: 0x0
codecForLocale: "System"
Run Code Online (Sandbox Code Playgroud)
解
如果我指定它是UTF-8我正在使用所以不同的类知道这个,那么它的工作原理.
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
qDebug() << "system name:" << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name();
QString name("\u00fc");
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();
QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();
QTextCodec *codec = QTextCodec::codecForName("latin1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();
Run Code Online (Sandbox Code Playgroud)
然后我得到这个输出:
system name: "en_US"
codecForCStrings: "UTF-8"
codecForLocale: "UTF-8"
utf8 "ü" "c3bc"
Latin1 "ü" "fc"
ISO 8859-1 "ü" "fc"
Run Code Online (Sandbox Code Playgroud)
看起来应该是这样的.
要知道的事情:
C++标准中有一些称为执行字符集的东西,它描述了字符串和字符文字在编译器生成的二进制文件中的输出.您可以在http://gcc.gnu.org网站上的C预处理器手册的1概述部分的1.1字符集子部分中阅读它.
问题:字符串文字
会产生什么"\u00fc"?
答:
这取决于执行字符集是什么.对于gcc(这是你正在使用的),默认情况下它是UTF-8,除非你指定了与-fexec-charset选项不同的东西.您可以在http://gcc.gnu.org网站上的GCC手册的3.11选项控制预处理器子部分的3个GCC命令选项部分中了解控制预处理阶段的这个和其他选项.现在,当我们知道执行字符集是UTF-8时,我们知道将转换为Unicode的代码点的UTF-8编码,这是两个字节的序列."\u00fc"U+00FC0xc3 0xbc
QString::QString ( const char * str )并QByteArray & QByteArray::append ( const QString & str )依赖于全球状态QString的构造函数接受使用编解码器设置的char *调用(如果已设置任何编解码器)或执行相同的操作(如果没有设置编解码器).QString QString::fromAscii ( const char * str, int size = -1 )void QTextCodec::setCodecForCStrings ( QTextCodec * codec )QString QString::fromLatin1 ( const char * str, int size = -1 )
问题:
QString的构造函数将使用什么编解码器来解码0xc3 0xbc它得到的两个字节序列()?
答:
默认情况下,没有设置编解码器,QTextCodec::setCodecForCStrings()这就是为什么Latin1将用于解码字节序列的原因.由于0xc3和0xbc在拉丁语1都有效,表示分别为A和¼(这应该已经很熟悉了,因为它是直接取自这个回答你刚才的问题),我们得到与这两个字符即QString.
qDebug() 不是8位清洁您不应该使用QDebugclass来输出ASCII之外的任何内容.你不能保证得到什么.
测试程序:
#include <QtCore>
void dbg(char const * rawInput, QString s) {
QString codepoints;
foreach(QChar chr, s) {
codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
}
qDebug() << "Input: " << rawInput
<< ", "
<< "Unicode codepoints: " << codepoints;
}
int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);
qDebug() << "system name:"
<< QLocale::system().name();
for (int i = 1; i <= 5; ++i) {
switch(i) {
case 1:
qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
break;
case 2:
qDebug() << "\nWith codecForCStrings set to UTF-8\n";
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
break;
case 3:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
break;
case 4:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
break;
}
qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
? QTextCodec::codecForCStrings()->name()
: "NOT SET");
qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale()
? QTextCodec::codecForLocale()->name()
: "NOT SET");
qDebug() << "\n1. Using QString::QString(char const *)";
dbg("\\u00fc", QString("\u00fc"));
dbg("\\xc3\\xbc", QString("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));
qDebug() << "\n2. Using QString::fromUtf8(char const *)";
dbg("\\u00fc", QString::fromUtf8("\u00fc"));
dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));
qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
}
return app.exec();
}
Run Code Online (Sandbox Code Playgroud)
在Windows XP上的mingw 4.4.0输出:
system name: "pl_PL"
Without codecForCStrings (default is Latin1)
codecForCStrings: "NOT SET"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
With codecForCStrings set to UTF-8
codecForCStrings: "UTF-8"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8
codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
Run Code Online (Sandbox Code Playgroud)
我要感谢来自#qt freenode.org IRC频道的thiago,cbreak,peppe和heinz,以展示和帮助我理解这里涉及的问题.
| 归档时间: |
|
| 查看次数: |
19746 次 |
| 最近记录: |