从字符串中删除每个非utf-8符号

Question

从字符串中删除每个非utf-8符号

我有大量的文件和解析器.我要做的是剥离所有非utf-8符号并将数据放入mongodb.目前我有这样的代码.

with open(fname, "r") as fp:
    for line in fp:
        line = line.strip()
        line = line.decode('utf-8', 'ignore')
        line = line.encode('utf-8', 'ignore')

Run Code Online (Sandbox Code Playgroud)

不知怎的,我仍然得到一个错误

bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

Run Code Online (Sandbox Code Playgroud)

我不明白.有一些简单的方法吗？

UPD:似乎Python和Mongo不同意Utf-8 Valid字符串的定义.

Answer 1

Irs*_*hat 61

尝试下面的代码行而不是最后两行.希望能帮助到你:

line=line.decode('utf-8','ignore').encode("utf-8")

Run Code Online (Sandbox Code Playgroud)

@ChediBechikh以下是你如何在python3`bytes(line,'utf-8')中执行它.solution('utf-8','ignore')` (10认同)
Python 3.5不再具有解码或编码功能 (9认同)
这个`line.decode('utf-8','ignore').encode("utf-8")`产生这个错误_AttributeError:'str'对象没有属性'decode'_,我使用python3 (2认同)
这似乎不起作用。我得到很多特殊字符：`\00\00\00\00\00` (2认同)

Answer 2

Ale*_*exG 18

对于python 3,如此线程中的注释所述,您可以执行以下操作:

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

Run Code Online (Sandbox Code Playgroud)

如果无法解码任何字符,'ignore'参数可防止引发错误.

如果你的行已经是一个字节对象(例如b'my string'),那么你只需要解码它decode('utf-8', 'ignore').

但是如果`line`在py3中已经是`str`，是否允许它是非utf8？ (4认同)

Answer 3

HMS*_*HMS 6

处理无 utf-8 字符的示例

import string

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"

print ''.join(x for x in test if x in string.printable)

Run Code Online (Sandbox Code Playgroud)

这将删除所有非 ascii 字符，其中包括许多有效的 UTF-8 字符 (9认同)

归档时间：	11 年，3 月前
查看次数：	60316 次
最近记录：	7 年，6 月前