Python将二进制文件转换为字符串，同时忽略非ascii字符

Question

Python将二进制文件转换为字符串，同时忽略非ascii字符

我有一个二进制文件，我想提取所有 ASCII 字符，同时忽略非 ASCII 字符。目前我有：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

Run Code Online (Sandbox Code Playgroud)

但是，我在写入 file 时遇到错误UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)。我怎样才能让Python忽略非ascii？

Answer 1

bgp*_*ter 5

使用内置的 ASCII 编解码器并告诉它忽略任何错误，例如：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

Run Code Online (Sandbox Code Playgroud)

您可以在 Python 解释器中测试和使用它：

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Run Code Online (Sandbox Code Playgroud)

只是尝试转换为字符串会引发异常。

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

...就像尝试将 unicode 字符串编码为 ASCII 一样：

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

...但是告诉编解码器忽略它无法处理的字符可以正常工作：

>>> s.encode('ascii', 'ignore')
'hello  there'

Run Code Online (Sandbox Code Playgroud)

@VeraWang - ASCII 字符 0..31 是不可打印的（包括这两个，请参阅此维基百科页面上有关 ASCII 的图表 - http://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart）也许有关的更多信息如果这不能满足您的需要，那么您试图解决的实际问题将会很有用...... (2认同)

归档时间：	10 年，7 月前
查看次数：	4208 次
最近记录：	7 年前