在 Python 3 中，如何从字符串中删除所有非 UTF8 字符？

Question

在 Python 3 中，如何从字符串中删除所有非 UTF8 字符？

Dav*_*ave 5 python encode decode utf-8 python-3.x

我正在使用Python 3.7。如何从字符串中删除所有非 UTF-8 字符？我尝试在下面使用“lambda x: x.decode('utf-8','ignore').encode(“utf-8”)”

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

Run Code Online (Sandbox Code Playgroud)

但这会导致错误......

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

Run Code Online (Sandbox Code Playgroud)

如果您有一种从字符串中删除所有非 UTF8 字符的通用方法，那么这就是我所寻找的。

Answer 1

Sha*_*ger 7

你从一个字符串开始。你不能（它已经decode是str解码的文本，你只能再次将其编码为二进制数据）。UTF-8 几乎对所有有效的 Unicode 文本（即str存储内容）进行编码，因此这种情况不会出现太多，但如果您在输入中遇到代理字符，则可以反转方向，更改：

x.decode('utf-8','ignore').encode("utf-8")

Run Code Online (Sandbox Code Playgroud)

到：

x.encode('utf-8','ignore').decode("utf-8")

Run Code Online (Sandbox Code Playgroud)

在其中对任何可编码的 UTF-8 内容进行编码，丢弃不可编码的内容，然后解码现在干净的 UTF-8 字节。

归档时间：	6 年，4 月前
查看次数：	8279 次
最近记录：	5 年前