and*_*ndy 4 python unicode binary hex ascii
关于在python中确定文件是二进制文件还是文本的解决方案,应答者使用:
textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))
Run Code Online (Sandbox Code Playgroud)
然后用于.translate(None, textchars)删除(或替换为空格)以二进制形式读入的文件中的所有此类字符.
回答者还争辩说,这种数字的选择是"基于文件(1)行为"(对于什么是文本而不是什么).这些数字的重要性是从二进制文件中确定文本文件?
它们代表可打印文本的最常见代码点,以及换行符,空格和回车符等.ASCII被覆盖到0x7F,像Latin-1或Windows Codepage 1251这样的标准使用剩余的128个字节来表示重音字符等.
您希望文本仅使用这些代码点.二进制数据将使用0x00-0xFF范围内的所有代码点; 例如,文本文件可能不会使用\ x00(NUL)或\ x1F(ASCII标准中的单位分隔符).
不过,它充其量只是一种启发式方法.一些文本文件仍然可以尝试使用明确命名的7个字符之外的C0控制代码,并且我确定存在的二进制数据恰好不包括textchars字符串中未包含的25个字节值.
该范围的作者可能基于该命令的text_chars表格file.它将字节标记为非文本,ASCII,Latin-1或非ISO扩展ASCII,并包含有关选择这些代码点的原因的文档:
/*
* This table reflects a particular philosophy about what constitutes
* "text," and there is room for disagreement about it.
*
* [....]
*
* The table below considers a file to be ASCII if all of its characters
* are either ASCII printing characters (again, according to the X3.4
* standard, not isascii()) or any of the following controls: bell,
* backspace, tab, line feed, form feed, carriage return, esc, nextline.
*
* I include bell because some programs (particularly shell scripts)
* use it literally, even though it is rare in normal text. I exclude
* vertical tab because it never seems to be used in real text. I also
* include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
* because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
* character to. It might be more appropriate to include it in the 8859
* set instead of the ASCII set, but it's got to be included in *something*
* we recognize or EBCDIC files aren't going to be considered textual.
*
* [.....]
*/
Run Code Online (Sandbox Code Playgroud)
有趣的是,该表排除了 0x7F,你发现的代码没有.
| 归档时间: |
|
| 查看次数: |
542 次 |
| 最近记录: |