如何过滤(或替换)UTF-8中超过3个字节的unicode字符？

Question

如何过滤(或替换)UTF-8中超过3个字节的unicode字符？

Den*_*aia 38 python mysql django unicode

我正在使用Python和Django,但我遇到了由MySQL限制引起的问题.根据MySQL 5.1文档,它们的utf8实现不支持4字节字符.MySQL 5.5将支持使用4字节字符utf8mb4; 而且,将来的某一天utf8也可能会支持它.

但我的服务器还没有准备好升级到MySQL 5.5,因此我只限于需要3个字节或更少的UTF-8字符.

我的问题是:如何过滤(或替换)超过3个字节的unicode字符？

我想用官方\ufffd(U + FFFD REPLACEMENT CHARACTER)替换所有4字节字符,或用?.

换句话说,我想要一种与Python自己的str.encode()方法非常相似的行为(传递'replace'参数时).编辑:我想要一个类似的行为encode(),但我不想实际编码字符串.我想在过滤后仍然有一个unicode字符串.

我不想在存储到MySQL之前转义字符,因为这意味着我需要从数据库中获取所有字符串,这非常烦人且不可行.

也可以看看:

将一些unicode字符保存到MySQL(在Django票证系统中)时出现"不正确的字符串值"警告
''不是有效的unicode字符,但在unicode字符集中？(在Stack Overflow上)

[编辑]添加了有关建议的解决方案的测试

所以到目前为止我得到了很好的答案谢谢,人!现在,为了选择其中一个,我做了一个快速测试,找到最简单和最快的一个.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

Run Code Online (Sandbox Code Playgroud)

结果:

filter_using_re()在0.139 CPU秒内完成515次函数调用(sub()内置0.138 CPU秒)
filter_using_python()在3.413 CPU秒内进行了2097923次函数调用(调用时为1.511 CPU秒,join()评估生成器表达式时为1.900 CPU秒)
我没有使用测试,itertools因为......嗯......这个解决方案虽然很有趣,却非常庞大而复杂.

结论

到目前为止,RegEx解决方案是最快的解决方案.

Answer 1

dra*_*ard 34

范围\ u0000-\uD7FF和\ uE000-\uFFFF中的Unicode字符在UTF8中将具有3字节(或更少)编码.\ uD800-\uDFFF范围适用于多字节UTF16.我不知道python,但你应该能够设置一个正则表达式来匹配那些范围之外.

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

Run Code Online (Sandbox Code Playgroud)

编辑在问题体中添加来自DenilsonSá脚本的Python:

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

您可以跳过解码和编码步骤,直接检测每个字符的第一个字节(8位字符串)的值.根据UTF-8:

#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Run Code Online (Sandbox Code Playgroud)

根据这个,你只需要检查每个字符的第一个字节的值来过滤掉4个字节的字符:

def filter_4byte_chars(s):
    i = 0
    j = len(s)
    # you need to convert
    # the immutable string
    # to a mutable list first
    s = list(s)
    while i < j:
        # get the value of this byte
        k = ord(s[i])
        # this is a 1-byte character, skip to the next byte
        if k <= 127:
            i += 1
        # this is a 2-byte character, skip ahead by 2 bytes
        elif k < 224:
            i += 2
        # this is a 3-byte character, skip ahead by 3 bytes
        elif k < 240:
            i += 3
        # this is a 4-byte character, remove it and update
        # the length of the string we need to check
        else:
            s[i:i+4] = []
            j -= 4
    return ''.join(s)

Run Code Online (Sandbox Code Playgroud)

跳过解码和编码部分将节省您一些时间,对于大多数具有1字节字符的较小字符串,这甚至可能比正则表达式过滤更快.

归档时间：	15 年，7 月前
查看次数：	25107 次
最近记录：	11 年，7 月前