解析 UTF-8 时防止形式过长

Question

解析 UTF-8 时防止形式过长

作为个人练习，我一直在研究另一个 UTF-8 解析器，虽然我的实现工作得很好，并且它拒绝大多数格式错误的序列（用 U+FFFD 替换它们），但我似乎不知道如何实现拒绝超长的形式。谁能告诉我该怎么做？

伪代码：

let w = 0, // the number of continuation bytes pending
    c = 0, // the currently being constructed codepoint
    b,     // the current byte from the source stream
    valid(c) = (
        (c < 0x110000) &&
        ((c & 0xFFFFF800) != 0xD800) &&
        ((c < 0xFDD0) || (c > 0xFDEF)) &&
        ((c & 0xFFFE) != 0xFFFE))
for each b:
    if b < 0x80:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
            w = 0
        append U+b to output string
    else if b < 0xc0:
        if w == 0: // unwanted continuation byte
            append U+FFFD to output string
        else:
            c |= (b & 0x3f) << (--w * 6)
            if w == 0: // done
                if valid(c):
                    append U+c to output string
    else if b < 0xfe:
        if w > 0: // premature ending to multi-byte sequence
            append U+FFFD to output string
        w = (b < 0xe0) ? 1 :
            (b < 0xf0) ? 2 :
            (b < 0xf8) ? 3 :
            (b < 0xfc) ? 4 : 5;
        c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
    else:
        append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
    append U+FFFD to output string

Run Code Online (Sandbox Code Playgroud)

Answer 1

xan*_*tos 5

如果您保存所需的字节数（因此您保存了初始值的第二个副本w），则可以将代码点的UTF32值（我认为您正在调用它c）与使用的字节数进行比较对其进行编码。你懂的：

U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes

Run Code Online (Sandbox Code Playgroud)

（我希望我在左栏中做了正确的数学计算！十六进制数学不是我的强项:-)）

正如旁注：我认为存在一些逻辑错误/格式错误。if b < 0x80 if w > 0如果 w = 0 会发生什么？（例如，如果您正在解码A）？c当你发现非法代码点时，你不应该重置吗？

归档时间：	14 年，4 月前
查看次数：	699 次
最近记录：	14 年，4 月前