如何确定文本的编码？

Question

如何确定文本的编码？

Nop*_*ope 204 python encoding text-files

我收到了一些编码的文本,但我不知道使用了什么字符集.有没有办法使用Python确定文本文件的编码？如何检测文本文件的编码/代码页处理C#.

Answer 1

始终正确地检测编码是不可能的.

(来自chardet FAQ :)

但是,某些编码针对特定语言进行了优化,语言不是随机的.一些字符序列一直弹出,而其他序列没有任何意义.一个英语流利的人打开一份报纸并发现"txzqJv 2!dasd0a QqdKjvz"会立刻发现这不是英文(即使它完全由英文字母组成).通过研究大量"典型"文本,计算机算法可以模拟这种流畅性,并对文本语言做出有根据的猜测.

有一个chardet库使用该研究来尝试检测编码.chardet是Mozilla中自动检测代码的一个端口.

您也可以使用UnicodeDammit.它将尝试以下方法:

在文档本身中发现的编码:例如,在XML声明中或(对于HTML文档)的http-equiv META标记.如果Beautiful Soup在文档中找到这种编码,它会从头开始再次解析文档并尝试新编码.唯一的例外是如果您明确指定了编码,并且该编码实际上有效:那么它将忽略它在文档中找到的任何编码.
通过查看文件的前几个字节来嗅探编码.如果在此阶段检测到编码,则它将是UTF-*编码,EBCDIC或ASCII之一.
如果安装了chardet库,则会对其进行嗅探.
UTF-8
Windows的1252

@Geomorillo:没有"编码标准"这样的东西.文本编码与计算一样古老,它随着时间和需求而有机地增长,没有计划."Unicode"试图解决这个问题. (13认同)
@dumbledad 我所说的是正确检测它**所有时间**是不可能的。您所能做的只是猜测，但有时它可能会失败，由于无法真正检测到编码，因此每次都无法正常工作。要进行猜测，您可以使用我在答案中建议的工具之一 (2认同)
@LasseKärkkäinen 这个答案的要点是表明正确检测编码是**不可能**；您提供的函数可以根据您的情况猜测正确，但在许多情况下是错误的。 (2认同)

Answer 2

Ham*_*ner 57

计算编码的另一个选择是使用 libmagic(这是file命令背后的代码 ).有大量的python绑定可用.

生成在文件源树中的python绑定可以作为 python-magic(或python3-magic)debian包使用.它可以通过执行以下操作来确定文件的编码:

import magic

blob = open('unknown-file').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

Run Code Online (Sandbox Code Playgroud)

在pypi上有一个同名但不兼容的python-magic pip包也可以使用libmagic.它还可以通过以下方式获得编码:

import magic

blob = open('unknown-file').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

Run Code Online (Sandbox Code Playgroud)

`libmagic`确实是`chardet`的可行替代品.以及名为`python-magic`的独特包装上的精彩信息!我确信这种模糊性会让很多人感到害怕 (5认同)
`sudo apt-get install python3-magic` for python3 (5认同)
`file` 不是特别擅长识别文本文件中的人类语言。它非常适合识别各种容器格式，但有时您必须知道它的含义（“Microsoft Office 文档”可能意味着 Outlook 消息等）。 (2认同)
@xtian 您需要以二进制模式打开，即 open("filename.txt", "rb")。 (2认同)

Answer 3

zza*_*art 28

一些编码策略,请取消注释:

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

Run Code Online (Sandbox Code Playgroud)

您可能希望通过以循环形式打开和读取文件来检查编码...但您可能需要先检查文件大小:

encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc]
            for e in encodings:
                try:
                    fh = codecs.open('file.txt', 'r', encoding=e)
                    fh.readlines()
                    fh.seek(0)
                except UnicodeDecodeError:
                    print('got unicode error with %s , trying different encoding' % e)
                else:
                    print('opening the file with encoding:  %s ' % e)
                    break

Run Code Online (Sandbox Code Playgroud)

Answer 4

rya*_*lon 20

这是一个读取和获取面值chardet编码预测的示例,n_lines在文件很大的情况下从文件中读取.

chardet还给你一个概率(即confidence)它的编码预测(没有看到它们是如何得出的),它与它的预测一起返回chardet.predict(),所以如果你愿意,你可以以某种方式工作.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

Run Code Online (Sandbox Code Playgroud)

我已经用这种方式修改了这个功能：`defpredict_encoding（file_path，n = 20）：... skip ...然后rawdata = b''。join（[[f.read（）for range in（n）中的_] ）`已在Python 3.6上尝试过此功能，与“ ascii”，“ cp1252”，“ utf-8”，“ unicode”编码完美兼容。因此，这绝对是正确的。 (2认同)
这对于处理各种格式的小型数据集非常有用。在我的根目录上递归地测试了这个，它的工作就像一种享受。谢谢哥们。 (2认同)

Answer 5

小智 9

这可能会有所帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

Run Code Online (Sandbox Code Playgroud)

谢谢，我现在可以确定正确的编码了！ (2认同)

Answer 6

小智 9

如果您对自动工具不满意，您可以尝试所有编解码器并手动查看哪个编解码器是正确的。

\n

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', \n'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', \n'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', \n'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', \n'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', \n'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', \n'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', \n'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', \n'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', \n'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', \n'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', \n'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', \n'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', \n'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', \n'utf_8', 'utf_8_sig']\n\ndef find_codec(text):\n    for i in all_codecs:\n        for j in all_codecs:\n            try:\n                print(i, "to", j, text.encode(i).decode(j))\n            except:\n                pass\n\nfind_codec("The example string which includes \xc3\xb6, \xc3\xbc, or \xc3\x84\xc5\xb8, \xc3\x83\xc2\xb6")\n

Run Code Online (Sandbox Code Playgroud)\n

该脚本至少创建 9409 行输出。因此，如果输出无法适合终端屏幕，请尝试将输出写入文本文件。

\n

作为对这个好答案的补充，我编写了一个[Python脚本](https://gist.github.com/FilipDominec/912b18147842ed5de7adbf3fab1413c9)，它打印出一个可能的编码不匹配的漂亮表格，提供了原始正确字符串的小样本由用户猜测。然后，它会建议哪个编码/解码对消除不匹配。 (3认同)

Answer 7

Bil*_*ore 6

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年前
查看次数：	198986 次
最近记录：	6 年，9 月前