如何将 MySQL 中的控制字符从 latin1 转换为 UTF-8?

Der*_*ney 7 mysql character-set

在将数据库转换为 UTF-8 时,我注意到有关控制字符 0x80-0x9F 的奇怪行为。例如,0x92(右撇号)不会被转换为 UTF-8 并使用以下方法截断列的其余内容:

CREATE TABLE `bar` (
 `content` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1

INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 1 row affected (0.06 sec)

SELECT content FROM bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ                                                 |
+---------------------------------------------------------------------------------+
1 row in set (0.06 sec)

ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected, 1 warning (0.06 sec)
Records: 1  Duplicates: 0  Warnings: 1

SHOW WARNINGS;
+---------+------+-------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                             |
+---------+------+-------------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x80\x81\x82\x83\x84\x85...' for column 'content' at row 1 |
+---------+------+-------------------------------------------------------------------------------------+
1 row in set (0.06 sec)

SELECT * FROM bar;
+---------+
| content |
+---------+
|         |
+---------+
1 row in set (0.06 sec)
Run Code Online (Sandbox Code Playgroud)

虽然通常情况下,Latin1 中不允许使用 0x80-0x9F,但 MySQL 似乎以不同的方式处理它:

MySQL 的 latin1 与 Windows cp1252 字符集相同。这意味着它与官方 ISO 8859-1 或 IANA(互联网编号分配机构)latin1 相同,除了 IANA latin1 将 0x80 和 0x9f 之间的代码点视为“未定义”,而 cp1252 以及 MySQL 的 latin1 分配字符对于那些职位。[源代码]

但是 MySQL 似乎无法将上述范围的值从它的 latin1 字符集转换为它的 UTF-8 字符集。

这些字符通过从 word 文档 (cp1252) 复制/粘贴进入我的数据库,虽然我可能已经找到了一种方法让应用程序为新条目强制使用正确的 UTF-8 值,但我需要确保旧的正确转换。

在 MySQL 中是否有一种方法可以将它们转换为等效的 UTF-8,而无需遍历每个文本列的每一行并用 ASCII 友好版本替换它们?

atx*_*dba 4

我不确定。我试图开始重现你的问题,但改变对我来说效果很好。

\n\n
test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);\nQuery OK, 0 rows affected (0.02 sec)\n\nQuery OK, 1 row affected (0.00 sec)\n\ntest > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;\nQuery OK, 1 row affected (0.04 sec)\nRecords: 1  Duplicates: 0  Warnings: 0\n\ntest > select * from bar;\n+---------------------------------+\n| content                         |\n+---------------------------------+\n| \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd |\n+---------------------------------+\n1 row in set (0.00 sec)\n\ntest > set names utf8;\nQuery OK, 0 rows affected (0.00 sec)\n\ntest > select * from bar;\n+---------------------------------------------------------------------------------+\n| content                                                                         |\n+---------------------------------------------------------------------------------+\n| \xe2\x82\xac\xe2\x80\x9a\xc6\x92\xe2\x80\x9e\xe2\x80\xa6\xe2\x80\xa0\xe2\x80\xa1\xe2\x80\xb0\xc5\xa0\xe2\x80\xb9\xc5\x92\xc5\xbd\xe2\x80\x98\xe2\x80\x99\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\xa2\xe2\x80\x93\xe2\x80\x94\xcb\x9c\xe2\x84\xa2\xc5\xa1\xe2\x80\xba\xc5\x93\xc5\xbe\xc5\xb8 |\n+---------------------------------------------------------------------------------+\n1 row in set (0.00 sec)\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是我的相关字符设置

\n\n
test > show variables like '%char%';\n+--------------------------+----------------------------+\n| Variable_name            | Value                      |\n+--------------------------+----------------------------+\n| character_set_client     | utf8                       |\n| character_set_connection | utf8                       |\n| character_set_database   | latin1                     |\n| character_set_filesystem | binary                     |\n| character_set_results    | utf8                       |\n| character_set_server     | latin1                     |\n| character_set_system     | utf8                       |\n| character_sets_dir       | /usr/share/mysql/charsets/ |\n+--------------------------+----------------------------+\n
Run Code Online (Sandbox Code Playgroud)\n\n

编辑

\n\n

运行之前我的字符设置设置名称 utf8

\n\n
test > show variables like '%char%';\n+--------------------------+----------------------------+\n| Variable_name            | Value                      |\n+--------------------------+----------------------------+\n| character_set_client     | latin1                     |\n| character_set_connection | latin1                     |\n| character_set_database   | latin1                     |\n| character_set_filesystem | binary                     |\n| character_set_results    | latin1                     |\n| character_set_server     | latin1                     |\n| character_set_system     | utf8                       |\n| character_sets_dir       | /usr/share/mysql/charsets/ |\n+--------------------------+----------------------------+\n8 rows in set (0.00 sec)\n
Run Code Online (Sandbox Code Playgroud)\n\n

版本

\n\n
test > select version();\n+-------------------------+\n| version()               |\n+-------------------------+\n| 5.1.41-3ubuntu12.10-log |\n+-------------------------+\n1 row in set (0.00 sec)\n
Run Code Online (Sandbox Code Playgroud)\n