从 sys.stdin 读取管道输入时如何防止“UnicodeDecodeError”？

Question

从 sys.stdin 读取管道输入时如何防止“UnicodeDecodeError”？

not*_*bit 2 python stdin pipe character-encoding python-3.x

我正在将一些主要的十六进制输入读取到 Python3 脚本中。但是，系统设置为使用UTF-8，当从 Bash shell 管道传输到脚本时，我不断收到以下UnicodeDecodeError 错误：

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

sys.stdin.read()根据其他 SO 答案，我在 Python3 中使用来读取管道输入，如下所示：

import sys
...
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
    try:
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)
...

Run Code Online (Sandbox Code Playgroud)

它在使用这种方式管道时起作用：

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

Run Code Online (Sandbox Code Playgroud)

但是，使用原始格式不会：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"

    ???
   ??

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

Run Code Online (Sandbox Code Playgroud)

并尝试了其他有希望的 SO 答案：

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Run Code Online (Sandbox Code Playgroud)

从我目前了解到的情况是，当您的终端遇到UTF-8序列时，它希望它后面跟着 1-3 个其他字节，如下所示：

UTF-8 是一种可变宽度字符编码，能够使用一到四个 8 位字节对 Unicode 中的所有有效代码点进行编码。因此，后的任何前导字节（在第一范围UTF-8字符0xC2 - 0xF4）应遵循的1-3 延续字节，取值范围0x80 - 0xBF。

但是，我不能总是确定我的输入流来自哪里，它很可能是原始数据，而不是上面的 ASCII 十六进制版本。所以我需要以某种方式处理这个原始输入。

我查看了一些替代方案，例如：

使用codecs.decode
open("myfile.jpg", "rb", buffering=0)与原始 I/O一起使用
bytes.decode(encoding="utf-8", errors="ignore")从字节使用
或者只是使用open(...)

但我不知道他们是否或如何像我想要的那样读取管道输入流。

如何让我的脚本也处理原始字节流？

附注。是的，我已经阅读了大量类似的 SO 问题，但没有一个能够充分处理这个 UTF-8 输入错误。最好的就是这个。

这不是重复的。

Answer 1

not*_*bit 6

我终于设法通过不使用sys.stdin!

相反，我使用了with open(0, 'rb'). 在哪里：

0是等效于的文件指针stdin。
'rb'正在使用二进制 模式读取.

这似乎规避了系统尝试在管道中解释您的语言环境字符的问题。看到以下内容后，我有了这个想法，并返回了正确的（不可打印的）字符：

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

???
   ??

Run Code Online (Sandbox Code Playgroud)

所以为了正确读取任何管道数据，我使用了：

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

Run Code Online (Sandbox Code Playgroud)

这会将您的管道数据读入 python字节字符串。

下一个问题是确定管道数据是来自字符串（如echo "BADDATA0"）还是来自二进制流。后者可以通过echo -ne "\xBA\xDD\xAT\xA0"如 OP 所示进行模拟。就我而言，我只是使用 RegEx 来查找越界的非 ASCII 字符。

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

Run Code Online (Sandbox Code Playgroud)

当然，这可以做得更好，更聪明。（欢迎评论！）

附录：（从这里）

mode是一个可选字符串，用于指定打开文件的模式。它默认为r这意味着以文本模式打开阅读。在文本模式下，如果未指定编码，则使用的编码取决于平台：locale.getpreferredencoding(False)调用以获取当前区域设置编码。（对于读取和写入原始字节，使用二进制模式并且不指定编码。）默认模式是“r”（打开读取文本，“rt”的同义词）。对于二进制读写访问，该模式w+b打开并将文件截断为 0 字节。r+b不截断地打开文件。

... Python 区分二进制和文本 I/O。以二进制模式打开的文件（包括b在 mode 参数中）将内容作为字节对象返回，无需任何解码。在文本模式下（默认，或当t包含在 mode 参数中时），文件的内容作为str返回，首先使用平台相关的编码或使用指定的编码（如果给定）解码的字节。

如果closefd是False并且给出了文件描述符而不是文件名，则在文件关闭时底层文件描述符将保持打开状态。如果给出了文件名，则必须是closefdTrue（默认值），否则将引发错误。

归档时间：	7 年，3 月前
查看次数：	3067 次
最近记录：	5 年，6 月前