字符串处理错误：UnicodeDecodeError: 'utf8' codec can't decode

Question

字符串处理错误：UnicodeDecodeError: 'utf8' codec can't decode

我正在尝试分析一系列密码的频率。我的脚本正在使用其他输入媒体，但是我当前的数据集中似乎有一些错误的字符。如何绕过“坏”数据？

import re
import collections 
words = re.findall('\w+', open('rockyou.txt').read().lower())
a=collections.Counter(words).most_common(50)
for word in a:
     print(word)

Run Code Online (Sandbox Code Playgroud)

然后我得到错误：

Traceback (most recent call last):
  File "shakecount.py", line 3, in <module>
    words = re.findall('\w+', open('rockyou.txt').read().lower().ASCII)
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 5079963: invalid continuation byte

Run Code Online (Sandbox Code Playgroud)

有任何想法吗？

Answer 1

agf*_*agf 5

您的代码与您的错误不完全匹配（我假设尝试调试？），但您的文本文件不是UTF-8.

您需要手动指定编码，我最好的猜测是latin-1：

words = re.findall('\w+', open('rockyou.txt', encoding='latin-1').read().lower())

Run Code Online (Sandbox Code Playgroud)

如果你想在出现错误的情况下继续，你可以通过errors='ignore'或errors='replace'到open。

归档时间：	13 年，8 月前
查看次数：	7823 次
最近记录：	13 年，8 月前