处理以 BOM (FF FE) 开头的文件

Question

处理以 BOM (FF FE) 开头的文件

dot*_*hen 12 character-encoding text-processing unicode

我收到了一个带有FF FEBOM的 .csv 文件：

$ head -n1 dotan.csv | hd
00000000  ff fe 41 00 64 00 20 00  67 00 72 00 6f 00 75 00  |..A.d. .g.r.o.u.|

Run Code Online (Sandbox Code Playgroud)

当使用awk解析它我得到了一堆空字节，我怀疑是由于字节顺序。如何交换此文件上的字节顺序（使用 CLI），以便普通工具可以使用它？

请注意，我认为此文件只是 ASCII 字符（BOM 除外），但我无法确认grep它是否为二进制文件：

$ grep -P '^[\x00-\x7f]' dotan.csv 
Binary file dotan.csv matches

Run Code Online (Sandbox Code Playgroud)

在 VIM 中搜索相同的字符串显示每个字符匹配！

使用iconvto 转换为 ASCII 并没有摆脱 \x00 值，实际上它使问题变得更糟，因为现在它们看起来像空字节而不是 UTF-8！

$ iconv -f UTF-8 -t ASCII dotan.csv > fixed.txt 
iconv: illegal input sequence at position 0

$ iconv -f UTF-8 -t ASCII//IGNORE dotan.csv > fixed.txt

$ head -n1 fixed.txt | hd
00000000  41 00 64 00 20 00 67 00  72 00 6f 00 75 00 70 00  |A.d. .g.r.o.u.p.|

Run Code Online (Sandbox Code Playgroud)

如何交换此文件上的字节顺序（使用 CLI），以便普通工具可以使用它？

Answer 1

cuo*_*glm 18

从这篇维基百科文章中，FF FE意思是UTF16LE. 所以你应该告诉iconv转换UTF16LE为UTF8：

iconv -f UTF-16LE -t UTF-8 dotan.csv > fixed.txt

Run Code Online (Sandbox Code Playgroud)

Answer 2

nis*_*ama 5

\ndos2unix还删除 BOM 并将 UTF-16 转换为 UTF-8：

\n\n

$ printf %s \xe3\x81\x82|recode ..utf16 >a;xxd -p a;dos2unix a;xxd -p a\nfeff3042\ndos2unix: converting file a to Unix format...\ne38182\n

Run Code Online (Sandbox Code Playgroud)\n\n

dos2unix还删除了 UTF-8 BOM：

\n\n

$ printf %b '\\xef\\xbb\\xbfa'>a;dos2unix a;xxd -p a\ndos2unix: converting file a to Unix format...\n61\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	11 年，5 月前
查看次数：	12273 次
最近记录：	6 年前