Python,UnicodeDecodeError

Question

Python,UnicodeDecodeError

我收到此错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

我尝试设置许多不同的编解码器(在标题中# -*- coding: utf8 -*-),甚至使用u"string",但它仍然出现.

我该如何解决？

编辑:我不知道导致这个的实际字符,但由于这是一个递归浏览文件夹的程序,它必须在其名称中找到一个包含奇怪字符的文件

码:

# -*- coding: utf8 -*-


# by TerabyteST

###########################

# Explores given path recursively
# and finds file which size is bigger than the set treshold

import sys
import os

class Explore():
    def __init__(self):
        self._filelist = []

    def exploreRec(self, folder, treshold):
        print folder
        generator = os.walk(folder + "/")
        try:
            content = generator.next()
        except:
            return
        folders = content[1]
        files = content[2]
        for n in folders:
            if "$" in n:
                folders.remove(n)
        for f in folders:
            self.exploreRec(u"%s/%s"%(folder, f), treshold)
        for f in files:
            try:
                rawsize = os.path.getsize(u"%s/%s"%(folder, f))
            except:
                print "Error reading file %s"%u"%s/%s"%(folder, f)
                continue
            mbsize = rawsize / (1024 * 1024.0)
            if mbsize >= treshold:
                print "File %s is %d MBs!"%(u"%s/%s"%(folder, f), mbsize)

Run Code Online (Sandbox Code Playgroud)

错误:

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    a.exploreRec("C:", 100)
  File "D:/Python/Explorator/shitfinder.py", line 35, in exploreRec
    print "Error reading file %s"%u"%s/%s"%(folder, f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

以下是使用的内容 print repr("Error reading file %s"%u"%s/%s"%(folder.decode('utf-8','ignore'), f.decode('utf-8','ignore')))

>>> a = Explore()
>>> a.exploreRec("C:", 100)
File C:/Program Files/Ableton/Live 8.0.4/Resources/DefaultPackages/Live8Library_v8.2.alp is 258 MBs!
File C:/Program Files/Adobe/Reader 9.0/Setup Files/{AC76BA86-7AD7-1040-7B44-A90000000001}/Data1.cab is 114 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/Art1.bar is 393 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art2.bar is 396 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art3.bar is 228 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/Sound/Sound.bar is 273 MBs!
File C:/ProgramData/Microsoft/Search/Data/Applications/Windows/Windows.edb is 162 MBs!
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk

Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    a.exploreRec("C:", 100)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 78: ordinal not in range(128)
>>>

Run Code Online (Sandbox Code Playgroud)

Answer 1

Joh*_*hin 15

我们无法猜测你要做什么,也不能猜测你的代码是什么,不是"设置许多不同的编解码器"意味着什么,也不知道你应该为你做什么"字符串".

请将您的代码更改为其初始状态,以便尽可能地反映您要执行的操作,再次运行,然后编辑您的问题以提供(1)您获得的完整回溯和错误消息(2)代码段包含脚本中显示在traceback中的最后一个语句(3)简要描述了您希望代码执行的操作(4)您正在运行的Python版本.

添加到问题的详细信息后编辑:

(0)让我们尝试对失败的语句进行一些转换:

原文:
print "Error reading file %s"%u"%s/%s"%(folder, f)
添加空格以减少
print "Error reading file %s" % u"%s/%s" % (folder, f)
难以辨认:添加括号以强调评估顺序:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
评估括号中的(常量)表达式:
print u"Error reading file %s/%s" % (folder, f)

这真的是你的意图吗？建议:使用更好的方法构建路径ONCE(参见下面的第(2)点).

(1)通常,使用repr(foo)或"%r" % foo用于诊断.这样,您的诊断代码不太可能导致异常(如此处所发生的那样)并且您避免了歧义.print repr(folder), repr(f)在尝试获取大小,重新运行和报告之前插入语句.

(2)不要通过u"%s/%s" % (folder, filename)...使用路径os.path.join(folder, filename)

(3)没有裸露的例外,检查已知问题.因此,未知的问题不会一直未知,请执行以下操作:

try:
    some_code()
except ReasonForBaleOutError:
    continue
except: 
    # something's gone wrong, so get diagnostic info
    print repr(interesting_datum_1), repr(interesting_datum_2)
    # ... and get traceback and error message
    raise

Run Code Online (Sandbox Code Playgroud)

更复杂的方式将涉及记录而不是打印,但上述情况要好于不知道发生了什么.

在rtm("os.walk")之后进行进一步编辑,记住旧的传说,并重新阅读您的代码:

(4)os.walk()走遍整棵树; 你不需要递归调用它.

(5)如果将unicode字符串传递给os.walk(),则结果(路径,文件名)将报告为unicode.你不需要那些"哇哇"的东西.然后,您只需选择显示unicode结果的方式.

(6)删除其中带有"$"的路径:您必须在原地修改列表,但您的方法很危险.尝试这样的事情:

for i in xrange(len(folders), -1, -1):
    if '$' in folders[i]:
        del folders[i]

Run Code Online (Sandbox Code Playgroud)

(7)通过加入文件夹名称和文件名来引用文件.您正在使用ORIGINAL文件夹名称; 当你撕掉递归时,这是行不通的; 你需要使用content[0]os.walk报告的当前丢弃的值.

(8)你应该发现自己使用的东西非常简单:

for folder, subfolders, filenames in os.walk(unicoded_top_folder):

Run Code Online (Sandbox Code Playgroud)

generator = os.walk(...); try: content = generator.next()如果你generator.next()将来需要做的话就不需要等和BTW ,except StopIteration而不是使用裸机.

(9)如果调用者提供了一个不存在的文件夹,则不会引发异常,它只会执行任何操作.如果提供的文件夹存在但是为空,则同上.如果您需要区分这两种情况,您需要自己进行额外的测试.

OP对此评论的回复: """谢谢,请阅读第一篇文章中显示的info repr().我不知道为什么它会打印这么多不同的项目,但看起来它们都有问题.所有这些之间的共同点是它们是.ink文件.这可能是问题吗？另外,在最后一个,firefox,它打印(Modalitrovvisoria),而来自资源管理器的真实文件名包含(Modalitàprovvisoria)"" "

(10)嗯,这不是".INK".lower(),它是".LNK".lower()...也许你需要改变你正在阅读的字体.

(11)"问题"文件名全部以".lnk"结尾的事实/可能与os.walk()和/或Windows对这些文件的名称做了特别的事情.

(12)我在这里重复你用来产生输出的Python语句,引入了一些空格:

print repr(
    "Error reading file %s" \
    % u"%s/%s" % (
        folder.decode('utf-8','ignore'),
        f.decode('utf-8','ignore')
        )
    )

Run Code Online (Sandbox Code Playgroud)

您似乎没有阅读,或者没有理解,或者只是忽略了我在给另一个答案的评论中给出的建议(以及回答者的答复):UTF-8与Windows文件中的文件名相关系统.

我们对什么文件夹和f引用感兴趣.您通过尝试使用UTF-8对其进行解码来践踏所有证据.您已使用"ignore"选项复杂化了混淆.如果你使用"替换"选项,你会看到"(Modalit\ufffdrovvisoria)"."ignore"选项在调试中没有位置.

在任何情况下,一些文件名出现某种错误但似乎没有丢失带有"忽略"选项的字符(或似乎没有被破坏)的事实是可疑的.

""插入声明print repr(folder), repr(f)"""的哪一部分你不明白？你需要做的就是这样:

print "Some meaningful text" # "error reading file" isn't
print "folder:", repr(folder)
print "f:", repr(f)

Run Code Online (Sandbox Code Playgroud)

(13)看来你在代码的其他地方引入了UTF-8,从追溯来看: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

我想指出你还是不知道文件夹和f是否引用了str对象或unicode对象,并且有两个答案表明它们很可能是str对象,所以为什么要介绍blahbah.encode()？

更一般的观点:在更改脚本之前,尝试了解您的问题是什么.关于尝试每个建议以及接近于零的有效调试技术的颠覆并不是前进的方向.

(14)再次运行脚本时,您可能希望通过在C:\的某个子集上运行它来减少输出量...特别是如果继续我的原始建议以调试打印所有文件名,不只是错误的(知道什么是非错误的看起来可能有助于理解问题).

回应Bryan McLemore的"清理"功能:

(15)这是一个带注释的交互式会话,说明了os.walk()和非ASCII文件名实际发生的情况:

C:\junk\terabytest>dir
[snip]
 Directory of C:\junk\terabytest

20/11/2009  01:28 PM    <DIR>          .
20/11/2009  01:28 PM    <DIR>          ..
20/11/2009  11:48 AM    <DIR>          empty
20/11/2009  01:26 PM                11 Hašek.txt
20/11/2009  01:31 PM             1,419 tbyte1.py
29/12/2007  09:33 AM                 9 Ð.txt
               3 File(s)          1,439 bytes
[snip]

C:\junk\terabytest>\python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pprint import pprint as pp
>>> import os

Run Code Online (Sandbox Code Playgroud)

os.walk(unicode_string) - >导致unicode对象

>>> pp(list(os.walk(ur"c:\junk\terabytest")))
[(u'c:\\junk\\terabytest',
  [u'empty'],
  [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']),
 (u'c:\\junk\\terabytest\\empty', [], [])]

Run Code Online (Sandbox Code Playgroud)

os.walk(str_string) - >结果是str对象

>>> pp(list(os.walk(r"c:\junk\terabytest")))
[('c:\\junk\\terabytest',
  ['empty'],
  ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']),
 ('c:\\junk\\terabytest\\empty', [], [])]

Run Code Online (Sandbox Code Playgroud)

cp1252是我希望在我的系统上使用的编码...

>>> u'\u0161'.encode('cp1252')
'\x9a'
>>> 'Ha\x9aek'.decode('cp1252')
u'Ha\u0161ek'

Run Code Online (Sandbox Code Playgroud)

正如预期的那样,使用UTF-8解码str不起作用

>>> 'Ha\x9aek'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python26\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte

Run Code Online (Sandbox Code Playgroud)

使用latin1可以无错误地解码任何随机字节串

>>> 'Ha\x9aek'.decode('latin1')
u'Ha\x9aek'

Run Code Online (Sandbox Code Playgroud)

但是U + 009A是一个控制角色(SINGLE CHARACTER INTRODUCER),即毫无意义的胡言乱语; 绝对与正确答案无关

>>> unicodedata.name(u'\u0161')
'LATIN SMALL LETTER S WITH CARON'
>>>

Run Code Online (Sandbox Code Playgroud)

(16)该示例显示了当字符在默认字符集中可表示时会发生什么; 如果不是这样会发生什么？这是一个包含CJK表意文字的文件名的例子(这次使用IDLE),这在我的默认字符集中无法表示:

IDLE 2.6.4      
>>> import os
>>> from pprint import pprint as pp

Run Code Online (Sandbox Code Playgroud)

repr(Unicode结果)看起来很好

>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese")))
[(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]

Run Code Online (Sandbox Code Playgroud)

并且unicode在IDLE中显示得很好:

>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0]
nihao??.txt

Run Code Online (Sandbox Code Playgroud)

str结果显然是通过使用.encode(无论如何,"替换")产生的 - 不是很有用,例如你不能通过传递它作为文件名来打开文件.

>>> pp(list(os.walk(r"c:\junk\terabytest\chinese")))
[('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]

Run Code Online (Sandbox Code Playgroud)

所以结论是,为了获得最佳结果,应该将一个unicode字符串传递给os.walk(),并处理任何显示问题.

Answer 2

hip*_*ker 6

Python默认使用ASCII编码,这很烦人.如果要永久更改它,请查找并编辑site.py文件,搜索以下def setencoding()几行更改 encoding = "ascii"为 encoding = "utf-8".再见,再见默认的ASCII编码.

归档时间：	16 年前
查看次数：	33635 次
最近记录：	13 年，2 月前