为什么没有UTF-24？

Question

为什么没有UTF-24？

Ant*_*ull 23 unicode character-encoding utf-32

可能重复:
为什么UTF-32存在而每个字符只需要21位？

UTF-32中的最大Unicode代码点为0x10FFFF.UTF-32有21个信息位和11个多余的空白位.那么为什么没有UTF-24编码(即删除了高字节的UTF-32)用于存储3个字节而不是4个字节的每个代码点？

Answer 1

Ski*_*tol 22

嗯,事实是:2007年建议使用UTF-24:

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

提到的利弊是:

"UTF-24 
Advantages: 
 1. Fixed length code units. 
 2. Encoding format is easily detectable for any content, even if mislabeled. 
 3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data. 
 4. If octets are dropped / inserted, decoder can resync at next valid code unit. 
 5. Practical for both internal processing and storage / interchange. 
 6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs 
    and UTF-7/8 multibyte sequences. 
 7. 7-bit transparent version can be easily derived. 
 8. Most compact for texts in archaic scripts. 
Disadvantages: 
 1. Takes more space then UTF-8/16, except for texts in archaic scripts. 
 2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values. 
 3. Incompatible with many legacy text-processing tools and protocols. "

Run Code Online (Sandbox Code Playgroud)

正如David Starner在http://www.mail-archive.com/unicode@unicode.org/msg16011.html中指出的那样:

为什么？UTF-24几乎总是比UTF-16更大,除非您正在使用Old Italic或Gothic中的文档.数学字母数字字符几乎总是与足够的ASCII组合使UTF-8成为胜利,如果没有,足够的BMP字符使UTF-16成为胜利.现代计算机不能很好地处理24位块; 在内存中,它们每个占用32位,除非你声明它们已经打包,然后它们比UTF-16或UTF-32慢很多.如果你要存储到磁盘,你也可以使用BOCU或SCSU(你已经非标准),或者使用UTF-8,UTF-16,BOCU或SCSU的标准压缩.压缩的SCSU或BOCU应该占用UTF-24的一半空间,如果是这样的话.

您还可以检查以下StackOverflow帖子:

为什么UTF-32存在而每个字符只需要21位？

第二个引用实际上是几年前,从2003年开始回复我的提议. (2认同)

归档时间：	13 年，9 月前
查看次数：	3011 次
最近记录：	13 年，9 月前