我正在寻找普通旧C中的代码片段,它检测到给定的字符串是UTF-8编码.我知道正则表达式的解决方案,但由于各种原因,最好避免在这种特殊情况下使用除了普通C之外的任何东西.
正则表达式的解决方案如下所示(警告:省略了各种检查):
#define UTF8_DETECT_REGEXP "^([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$"
const char *error;
int error_off;
int rc;
int vect[100];
utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);
rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));
if (rc > 0) {
printf("string is in UTF8\n");
} else {
printf("string is not in UTF8\n")
}
Run Code Online (Sandbox Code Playgroud)
Chr*_*oph 44
这是在纯C 中这个表达式的一个(希望无错)实现:
_Bool is_utf8(const char * string)
{
if(!string)
return 0;
const unsigned char * bytes = (const unsigned char *)string;
while(*bytes)
{
if( (// ASCII
// use bytes[0] <= 0x7F to allow ASCII control characters
bytes[0] == 0x09 ||
bytes[0] == 0x0A ||
bytes[0] == 0x0D ||
(0x20 <= bytes[0] && bytes[0] <= 0x7E)
)
) {
bytes += 1;
continue;
}
if( (// non-overlong 2-byte
(0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF)
)
) {
bytes += 2;
continue;
}
if( (// excluding overlongs
bytes[0] == 0xE0 &&
(0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// straight 3-byte
((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
bytes[0] == 0xEE ||
bytes[0] == 0xEF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// excluding surrogates
bytes[0] == 0xED &&
(0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
)
) {
bytes += 3;
continue;
}
if( (// planes 1-3
bytes[0] == 0xF0 &&
(0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// planes 4-15
(0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// plane 16
bytes[0] == 0xF4 &&
(0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
)
) {
bytes += 4;
continue;
}
return 0;
}
return 1;
}
Run Code Online (Sandbox Code Playgroud)
请注意,这是对W3C推荐的用于表单验证的正则表达式的忠实翻译,它确实拒绝了一些有效的UTF-8序列(特别是那些包含ASCII控制字符的序列).
此外,即使在通过进行注释中提到的更改来解决此问题之后,它仍然假定为零终止,这可以防止嵌入NUL字符,尽管它在技术上应该是合法的.
当我涉足创建自己的字符串库时,我选择了修改后的UTF-8(即将NUL编码为超长的双字节序列) - 随意使用此标头作为模板提供不受影响的验证例程以上缺点.
Joa*_*kim 31
Bjoern Hoermann的这个解码器是我发现的最简单的解码器.它也可以通过输入单个字节,以及保持状态来工作.该状态对于解析通过网络进入块的UTF8非常有用.
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
#define UTF8_ACCEPT 0
#define UTF8_REJECT 1
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};
uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
uint32_t type = utf8d[byte];
*codep = (*state != UTF8_ACCEPT) ?
(byte & 0x3fu) | (*codep << 6) :
(0xff >> type) & (byte);
*state = utf8d[256 + *state*16 + type];
return *state;
}
Run Code Online (Sandbox Code Playgroud)
一个简单的验证器/检测器不需要代码点,因此它可以这样写(初始状态设置为UTF8_ACCEPT):
uint32_t validate_utf8(uint32_t *state, char *str, size_t len) {
size_t i;
uint32_t type;
for (i = 0; i < len; i++) {
// We don't care about the codepoint, so this is
// a simplified version of the decode function.
type = utf8d[(uint8_t)str[i]];
*state = utf8d[256 + (*state) * 16 + type];
if (*state == UTF8_REJECT)
break;
}
return *state;
}
Run Code Online (Sandbox Code Playgroud)
如果文本有效,UTF8_ACCEPT则返回utf8 .如果它无效UTF8_REJECT.如果需要更多数据,则返回一些其他整数.
用于以块(例如来自网络)提供数据的用法示例:
char buf[128];
size_t bytes_read;
uint32_t state = UTF8_ACCEPT;
// Validate the UTF8 data in chunks.
while ((bytes_read = get_new_data(buf, sizeof(buf))) {
if (validate_utf8(&state, buf, bytes_read) == UTF8_REJECT)) {
fprintf(stderr, "Invalid UTF8 data!\n");
return -1;
}
}
// If everything went well we should have proper UTF8,
// the data might instead have ended in the middle of a UTF8
// codepoint.
if (state != UTF8_ACCEPT) {
fprintf(stderr, "Invalid UTF8, incomplete codepoint\n");
}
Run Code Online (Sandbox Code Playgroud)
不能检测一个给定字符串(或字节序列)是UTF-8编码的文本,例如每个系列UTF-8的八位位组也有效(如果无意义)系列的Latin-1的(或一些其它的编码)字节.然而,并非每个有效的Latin-1八位字节系列都是有效的UTF-8系列.因此,您可以排除不符合UTF-8编码模式的字符串:
U+0000-U+007F 0xxxxxxx
U+0080-U+07FF 110yyyxx 10xxxxxx
U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx
U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
Run Code Online (Sandbox Code Playgroud)
您必须将字符串解析为UTF-8,请参阅http://www.rfc-editor.org/rfc/rfc3629.txt这非常简单.如果解析失败,则不是UTF-8.有几个简单的UTF-8库可以做到这一点.
如果你知道字符串是普通的旧ASCII 或者它包含UTF-8编码的ASCII之外的字符,那么它可能会被简化.在这种情况下,您通常不需要关心差异,UTF-8的设计是可以处理ASCII的现有程序,在大多数情况下可以透明地处理UTF-8.
请记住,ASCII是以UTF-8编码的,因此ASCII是有效的UTF-8.
AC字符串可以是任何东西,是您需要解决的问题,您不知道内容是ASCII,GB 2312,CP437,UTF-16,还是其他任何使程序生活变得困难的字符编码. ?