如何检测普通C中的UTF-8?

Kon*_*tin 33 c utf-8

我正在寻找普通旧C中的代码片段,它检测到给定的字符串是UTF-8编码.我知道正则表达式的解决方案,但由于各种原因,最好避免在这种特殊情况下使用除了普通C之外的任何东西.

正则表达式的解决方案如下所示(警告:省略了各种检查):

#define UTF8_DETECT_REGEXP  "^([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$"

const char *error;
int         error_off;
int         rc;
int         vect[100];

utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);

rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));

if (rc > 0) {
    printf("string is in UTF8\n");
} else {
    printf("string is not in UTF8\n")
}
Run Code Online (Sandbox Code Playgroud)

Chr*_*oph 44

这是在纯C 中这个表达式的一个(希望无错)实现:

_Bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}
Run Code Online (Sandbox Code Playgroud)

请注意,这是对W3C推荐的用于表单验证的正则表达式的忠实翻译,它确实拒绝了一些有效的UTF-8序列(特别是那些包含ASCII控制字符的序列).

此外,即使在通过进行注释中提到的更改来解决此问题之后,它仍然假定为零终止,这可以防止嵌入NUL字符,尽管它在技术上应该是合法的.

当我涉足创建自己的字符串库时,我选择了修改后的UTF-8(即将NUL编码为超长的双字节序列) - 随意使用此标头作为模板提供不受影响的验证例程以上缺点.

  • @Lucas:不,它不能:`&&`会使这种情况短路,因为`0`不在任何有效的多字节序列范围内 (5认同)
  • 由于您正在读取字节+1,+ 2,+ 3并且仅检查该字节!= 0,因此该代码可以读取超过字符串的结尾.即使它是零终止. (4认同)
  • 这太棒了,非常感谢您提供此代码。 (2认同)
  • @DanW; Danny的`IsUTF8Bytes()`只验证字节字符串是否符合UTF-8编码方案(参见[Stefan的回答](http://stackoverflow.com/a/1031683/48015)),而我的做了一些额外的验证排除非字符和ASCII控制序列; 后者实际上是有效的UTF-8 - 见上面的评论 (2认同)

Joa*_*kim 31

Bjoern Hoermann的这个解码器是我发现的最简单的解码器.它也可以通过输入单个字节,以及保持状态来工作.该状态对于解析通过网络进入块的UTF8非常有用.

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
  8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
  0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
  0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
  0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
  1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
  1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
  1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

  *state = utf8d[256 + *state*16 + type];
  return *state;
}
Run Code Online (Sandbox Code Playgroud)

一个简单的验证器/检测器不需要代码点,因此它可以这样写(初始状态设置为UTF8_ACCEPT):

uint32_t validate_utf8(uint32_t *state, char *str, size_t len) {
   size_t i;
   uint32_t type;

    for (i = 0; i < len; i++) {
        // We don't care about the codepoint, so this is
        // a simplified version of the decode function.
        type = utf8d[(uint8_t)str[i]];
        *state = utf8d[256 + (*state) * 16 + type];

        if (*state == UTF8_REJECT)
            break;
    }

    return *state;
}
Run Code Online (Sandbox Code Playgroud)

如果文本有效,UTF8_ACCEPT则返回utf8 .如果它无效UTF8_REJECT.如果需要更多数据,则返回一些其他整数.

用于以块(例如来自网络)提供数据的用法示例:

char buf[128];
size_t bytes_read;
uint32_t state = UTF8_ACCEPT;

// Validate the UTF8 data in chunks.
while ((bytes_read = get_new_data(buf, sizeof(buf))) {
    if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) {
        fprintf(stderr, "Invalid UTF8 data!\n");
        return -1;
    }
}

// If everything went well we should have proper UTF8,
// the data might instead have ended in the middle of a UTF8
// codepoint.
if (state != UTF8_ACCEPT) {
    fprintf(stderr, "Invalid UTF8, incomplete codepoint\n");
}
Run Code Online (Sandbox Code Playgroud)


Ste*_*rig 9

不能检测一个给定字符串(或字节序列)是UTF-8编码的文本,例如每个系列UTF-8的八位位组也有效(如果无意义)系列的Latin-1的(或一些其它的编码)字节.然而,并非每个有效的Latin-1八位字节系列都是有效的UTF-8系列.因此,您可以排除不符合UTF-8编码模式的字符串:

U+0000-U+007F    0xxxxxxx
U+0080-U+07FF    110yyyxx    10xxxxxx
U+0800-U+FFFF    1110yyyy    10yyyyxx    10xxxxxx
U+10000-U+10FFFF 11110zzz    10zzyyyy    10yyyyxx    10xxxxxx   
Run Code Online (Sandbox Code Playgroud)

  • +1这都是猜测."这可能是utf8" (6认同)

nos*_*nos 6

您必须将字符串解析为UTF-8,请参阅http://www.rfc-editor.org/rfc/rfc3629.txt这非常简单.如果解析失败,则不是UTF-8.有几个简单的UTF-8库可以做到这一点.

如果你知道字符串是普通的旧ASCII 或者它包含UTF-8编码的ASCII之外的字符,那么它可能会被简化.在这种情况下,您通常不需要关心差异,UTF-8的设计是可以处理ASCII的现有程序,在大多数情况下可以透明地处理UTF-8.

请记住,ASCII是以UTF-8编码的,因此ASCII是有效的UTF-8.

AC字符串可以是任何东西,是您需要解决的问题,您不知道内容是ASCII,GB 2312,CP437,UTF-16,还是其他任何使程序生活变得困难的字符编码. ?