lkj*_*dsa 0 c++ linux unicode raspberry-pi
我正在使用树莓派并尝试使用以下内容打印 unicode 字符:
测试.cpp:
#include<iostream>
using namespace std;
int main() {
char a=L'\u1234';
cout << a << endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
当我用 g++ 编译时,我收到这个警告:
test.cpp: In function "int main()":
test.cpp:4:9: warning: large integer implicitly truncated to unsigned type [-Woverflow]
Run Code Online (Sandbox Code Playgroud)
输出是:
4
Run Code Online (Sandbox Code Playgroud)
此外,这不在 GUI 中,如果相关,我的发行版是 raspbian wheezy。
作为对先前答案之一的参考,您不应在 Linux 上使用 wchar_t 和 w* 函数。POSIX API 使用char数据类型,大多数 POSIX 实现使用 UTF-8 作为默认编码。引用 C++ 标准 (ISO/IEC 14882:2011)
5.3.3 大小
sizeof(char), sizeof(signed char) 和 sizeof(unsigned char)是1。应用于任何其他基本类型 (3.9.1) 的 sizeof 结果是实现定义的。[ 注意:特别是 sizeof(bool)、sizeof(char16_t)、sizeof(char32_t) 和sizeof(wchar_t)是 实现定义的。74 — 尾注]
UTF-8 使用 1 字节代码单元和最多 4 个代码单元来表示一个代码点,因此char足以存储 UTF-8 字符串,但要操作它们,您将需要找出是否表示特定的代码单元多个字节,并牢记这一点来构建您的处理逻辑。wchar_t有一个实现定义的大小,我看到的 Linux 发行版对于这种数据类型有 4 个字节的大小。
还有一个问题是从源代码到目标代码的映射可能会以特定于编译器的方式转换您的编码:
2.2 翻译阶段
物理源文件中的字符被映射,在一个 实现定义的方式,基本源字符集(引入终了行指标新行字符),如果必要的。
无论如何,在大多数情况下,您的源代码没有任何转换,因此您放入的字符串char*保持不变。如果您使用 UTF-8 对源代码进行编码,那么您的char*s中将有表示 UTF-8 代码单元的字节。
As for your code example: it does not work as expected because 1 char has a size of 1 byte. Unicode code points may require several (up to 4) UTF-8 code units to be serialized (for UTF-8 1 code unit == 1 byte). You can see here that U+1234 requires three bytes E1 88 B4 when UTF-8 is used and, therefore, cannot be stored in a single char. If you modify your code as follows it's going to work just fine:
#include <iostream>
int main() {
char* str = "\u1234";
std::cout << str << std::endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
This is going to output ? though you may see nothing depending on your console and the installed fonts, the actual bytes are going to be there. Note that with double quotes you also have a \0 terminator in-memory.
You could also use an array, but not with single quotes since you would need a different data type (see here for more information):
#include <iostream>
int main() {
char* str = "\u1234";
std::cout << str << std::endl;
// size of the array is 4 because \0 is appended
// for string literals and there are 3 bytes
// needed to represent the code point
char arr[4] = "\u1234";
std::cout.write(arr, 3);
std::cout << std::endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
The output is going to be ? on the two different lines in this case.