file 命令显然返回错误的 MIME 类型

Jas*_*ett 8 file-command mime-types

为什么以下不返回text/csv

$ echo 'foo,bar\nbaz,quux' > temp.csv;file -b --mime temp.csv
text/plain; charset=us-ascii
Run Code Online (Sandbox Code Playgroud)

我使用这个例子是为了更加清晰,但我也遇到了其他 CSV 文件的问题。

$ file -b --mime '/Users/jasonswett/projects/client_work/gd/spec/test_files/wtf.csv'
text/plain; charset=us-ascii
Run Code Online (Sandbox Code Playgroud)

为什么它不认为 CSV 是 CSV?我可以对 CSV 做些什么来file返回“正确”的东西吗?

小智 7

mimetypes 由 unix manpages 称为“magic numbers”的内容决定。在每个文件中都有一个确定文件类型和文件格式的幻数。下面的摘录来自文件命令手册页

The magic number tests are used to check for files with data in partic-
       ular fixed formats.  The canonical example of this  is  a  binary  exe-
       cutable  (compiled  program)  a.out  file,  whose  format is defined in
       a.out.h and possibly exec.h in the standard include  directory.   These
       files  have  a  'magic  number'  stored  in a particular place near the
       beginning of the file that tells the UNIX  operating  system  that  the
       file  is  a binary executable, and which of several types thereof.  The
       concept of 'magic number' has been applied by extension to data  files.
       Any  file  with  some invariant identifier at a small fixed offset into
       the file can usually be described in this way.  The information identi-
       fying   these   files   is   read   from   the   compiled   magic  file
       /usr/share/file/magic.mgc , or  /usr/share/file/magic  if  the  compile
       file  does  not exist. In addition file will look in $HOME/.magic.mgc ,
       or $HOME/.magic for magic entries.
Run Code Online (Sandbox Code Playgroud)

unix 手册页还提到,如果文件与幻数不匹配,则文本文件将被视为 ASCII/ISO-8859-x/非 ISO 8 位扩展 ASCII(最适合的格式)

 If a file does not match any of the entries in the magic    file,  it  is
       examined to see if it seems to be a text file.  ASCII, ISO-8859-x, non-
       ISO 8-bit extended-ASCII character sets (such as those used  on  Macin-
       tosh  and  IBM  PC systems), UTF-8-encoded Unicode, UTF-16-encoded Uni-
       code, and EBCDIC character sets can be distinguished by  the  different
       ranges  and  sequences  of bytes that constitute printable text in each
       set.  If a file passes  any  of  these  tests,  its  character  set  is
       reported.  ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are iden-
       tified as ''text'' because they will be mostly readable on  nearly  any
       terminal
Run Code Online (Sandbox Code Playgroud)

建议

使用mimetype命令而不是文件命令

mimetype temp.csv
Run Code Online (Sandbox Code Playgroud)

用于进一步挖掘的网络链接

http://unixhelp.ed.ac.uk/CGI/man-cgi?file
Run Code Online (Sandbox Code Playgroud)


小智 6

不幸的是,您可能无法使文件产生正确的输出。

file命令根据幻数数据库测试文件的前几个字节。这很容易在二进制文件(如图像或可执行文件)中检查,这些文件在文件开头有一些特定的标识符。

如果文件不是二进制文件,它会检查编码并查找文件中的某些特定单词以确定类型,但仅限于有限数量的文件类型(大多数是编程语言)。