Python,file(1) - 为什么数字[7,8,9,10,12,13,27]和范围(0x20,0x100)用于确定文本vs二进制文件

Question

Python,file(1) - 为什么数字[7,8,9,10,12,13,27]和范围(0x20,0x100)用于确定文本vs二进制文件

and*_*ndy 4 python unicode binary hex ascii

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

Run Code Online (Sandbox Code Playgroud)

然后用于.translate(None, textchars)删除(或替换为空格)以二进制形式读入的文件中的所有此类字符.

回答者还争辩说,这种数字的选择是"基于文件(1)行为"(对于什么是文本而不是什么).这些数字的重要性是从二进制文件中确定文本文件？

Answer 1

Mar*_*ers 6

它们代表可打印文本的最常见代码点,以及换行符,空格和回车符等.ASCII被覆盖到0x7F,像Latin-1或Windows Codepage 1251这样的标准使用剩余的128个字节来表示重音字符等.

您希望文本仅使用这些代码点.二进制数据将使用0x00-0xFF范围内的所有代码点; 例如,文本文件可能不会使用\ x00(NUL)或\ x1F(ASCII标准中的单位分隔符).

不过,它充其量只是一种启发式方法.一些文本文件仍然可以尝试使用明确命名的7个字符之外的C0控制代码,并且我确定存在的二进制数据恰好不包括textchars字符串中未包含的25个字节值.

该范围的作者可能基于该命令的text_chars表格file.它将字节标记为非文本,ASCII,Latin-1或非ISO扩展ASCII,并包含有关选择这些代码点的原因的文档:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

Run Code Online (Sandbox Code Playgroud)

有趣的是,该表排除了 0x7F,你发现的代码没有.

归档时间：	10 年，9 月前
查看次数：	542 次
最近记录：	10 年，9 月前