解压 mojibake

Question

解压 mojibake

wim*_*wim 4 python unicode decoding character-encoding mojibake

当您错误地解码字符时，您如何识别原始字符串的可能候选者？

Ä×èÈÄÄî?è¤ô_üiâAâjâüâpâXüj_10òb.png

Run Code Online (Sandbox Code Playgroud)

我知道这个图像文件名应该是一些日语字符。但是由于对 urllib 引用/取消引用、编码和解码 iso8859-1、utf8 的各种猜测，我一直无法取消并获得原始文件名。

腐败是可逆的吗？

Answer 1

gal*_*den 5

您可以使用 chardet（使用 pip 安装）：

import chardet

your_str = "Ä×èÈÄÄî?è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

Run Code Online (Sandbox Code Playgroud)

结果：???????????????_10? （不知道这是否正确）

对于 Python 3（源文件编码为 utf8）：

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

Run Code Online (Sandbox Code Playgroud)

总之：

>>> s = 'Ä×èÈÄÄî?è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'?????????????_10?.png'

Run Code Online (Sandbox Code Playgroud)

由于上面包含一个带有非 ASCII 字符的字节字符串，发生的情况取决于您保存源文件的编码。因为字符串 `Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb` 是代码页 932 编码的结果（Shift -JIS-like) 字符串被误解为代码页 850（DOS 西欧），上面的源代码必须保存为 cp850 才能工作。 (4认同)
谷歌翻译说“测试时间角度（动画路径）_10秒”，看起来差不多有道理！ (2认同)

归档时间：	11 年，5 月前
查看次数：	2920 次
最近记录：	7 年，1 月前