如何处理未知编码

Question

如何处理未知编码

我在使用需要打开不同编码的文件的Python脚本时遇到了一些问题.

我通常使用这个:

with open(path_to_file, 'r') as f:
    first_line = f.readline()

Run Code Online (Sandbox Code Playgroud)

当文件正确编码时,这很有用.

但有时,它不起作用,例如使用此文件,我有这个:

In [22]: with codecs.open(filename, 'r') as f:
    ...:    a = f.readline()
    ...:    print(a)
    ...:    print(repr(a))
    ...:     
??Test for StackOverlow

'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'

Run Code Online (Sandbox Code Playgroud)

我想在这些方面搜索一些东西.可悲的是,用这种方法,我不能:

In [24]: "Test" in a
Out[24]: False

Run Code Online (Sandbox Code Playgroud)

我在这里发现了很多问题,指的是同一类型的问题:

但无法设法正确解码文件...

使用codecs.open():

In [17]: with codecs.open(filename, 'r', "utf-8") as f:
    a = f.readline()
    print(a)
   ....:     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
      1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2     a = f.readline()
      3     print(a)
      4 

/usr/lib/python2.7/codecs.pyc in readline(self, size)
    688     def readline(self, size=None):
    689 
--> 690         return self.reader.readline(size)
    691 
    692     def readlines(self, sizehint=None):

/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
    543         # If size is given, we call read() only once
    544         while True:
--> 545             data = self.read(readsize, firstline=True)
    546             if data:
    547                 # If we're at a "\r" read one extra character (which might

/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
    490             data = self.bytebuffer + newdata
    491             try:
--> 492                 newchars, decodedbytes = self.decode(data, self.errors)
    493             except UnicodeDecodeError, exc:
    494                 if firstline:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Run Code Online (Sandbox Code Playgroud)

with encode('utf-8):

In [18]: with codecs.open(filename, 'r') as f:
    a = f.readline()
    print(a)
   ....:     a.encode('utf-8')
   ....:     print(a)
   ....:     
??Test for StackOverlow

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
      2     a = f.readline()
      3     print(a)
----> 4     a.encode('utf-8')
      5     print(a)
      6 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我找到了一种使用Vim自动更改文件编码的方法:

system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)

Run Code Online (Sandbox Code Playgroud)

但是我想在不使用Vim的情况下这样做......

任何帮助将不胜感激.

Answer 1

Jor*_*ley 6

看起来这是utf-16-le(utf-16小端...)但是你错过了决赛 \x00

>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>

Run Code Online (Sandbox Code Playgroud)

Answer 2

hol*_*web 5

看来您需要检测输入文件中的编码。这个问题chardet的答案中提到的库可能会有所帮助（尽管请注意，不可能进行完整的编码检测）。

然后，您可以使用已知的编码将文件写出。在处理Unicode时，请记住，在进行过程外通信之前，必须将其编码为合适的字节流。在输入上解码，然后在输出上编码。

实际解决了OP提出的问题+1 (2认同)

归档时间：	9 年，6 月前
查看次数：	1311 次
最近记录：	9 年，6 月前