如何在python中解码非unicode字符？

Question

如何在python中解码非unicode字符？

s = 'Chocolate Moelleux-M\xe8re'当我在做的时候,我有一个字符串说:

In [14]: unicode(s)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

类似地,当我试图通过使用s.decode()它来解码它时返回相同的错误.

In [13]: s.decode()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

如何将这样的字符串解码成unicode.

Answer 1

Sri*_*aju 10

我不得不多次面对这个问题.我在不同的编码方案中包含字符串的问题.所以我写了一个基于不同编码的某些特征启发式解码字符串的方法.

def decode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
    """
    Try to interpret 'string' using several possible encodings.
    @input : string, encode type.
    @output: a list [decoded_string, flag_decoded, encoding]
    """
    if isinstance(string, unicode): return string, 0, "utf-8"
    try:
        new_string = unicode(string, "ascii")
        return string, 0, "ascii"
    except UnicodeError:
        encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]

        if denc != "ascii": encodings.insert(0, denc)

        if enc: encodings.insert(0, enc)

        for enc in encodings:
            if (enc in ("iso-8859-15", "iso-8859-1") and
                re.search(r"[\x80-\x9f]", string) is not None):
                continue

            if (enc in ("iso-8859-1", "cp1252") and
                re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
                is not None):
                continue

            try:
                new_string = unicode(string, enc)
            except UnicodeError:
                pass
            else:
                if new_string.encode(enc) == string:
                    return new_string, 0, enc

        # If unable to decode,doing force decoding i.e.neglecting those chars.
        output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
        output = [(len(new_string[0]), new_string) for new_string in output]
        output.sort()
        new_string, enc = output[-1][1]
        return new_string, 1, enc

Run Code Online (Sandbox Code Playgroud)

为此添加此链接可以很好地反馈编码等原因 - 为什么我们需要py脚本中的sys.setdefaultencoging

归档时间：	15 年，2 月前
查看次数：	5723 次
最近记录：	15 年，2 月前