如何在文本文件中检测无效的utf8 unicode/binary

use*_*196 43 linux bash utf-8 character-encoding

我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符.

�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
Run Code Online (Sandbox Code Playgroud)

我试过的:

iconv -f utf-8 -t utf-8 -c file.csv 
Run Code Online (Sandbox Code Playgroud)

这将文件从utf-8编码转换为utf-8编码,-c用于跳过无效的utf-8字符.然而最后这些非法字符仍然被打印出来.在linux或其他语言的bash中还有其他解决方案吗?

Bla*_*laf 51

假设您的语言环境设置为UTF-8,这可以很好地识别无效的UTF-8序列:

grep -axv '.*' file.txt
Run Code Online (Sandbox Code Playgroud)

  • 这段代码对我有帮助.关于使用选项的一些简短的说明会更直接地帮助我.这里使用的选项:`-a`将文件视为文本,必要时防止`grep`在找到无效的字节序列(不是utf8)时中止,`-v`反转输出显示不匹配的行,最后`-x'.*''表示匹配由任何utf8字符组成的完整行.因此会有输出,这是包含无效的utf8字节序列的行(包括反转的`-v`). (22认同)
  • 如何判断我的语言环境是否设置为 UTF-8? (2认同)
  • 检查`locale`的输出。 (2认同)
  • `grep` 使用 `locale` 输出的哪一部分来确定编码?您能否建议像“CORRECT_VAR_NAME=en_US.UTF-8 grep -axv '.*' file.txt”这样的命令? (2认同)

fed*_*qui 14

我想要grep非ASCII字符.

使用带有pcre的GNU grep(因为-P,总是不可用.在FreeBSD上你可以在包pcre2中使用pcregrep)你可以这样做:

grep -P "[\x80-\xFF]" file
Run Code Online (Sandbox Code Playgroud)

如何grep中的参考UNIX中的所有非ASCII字符.所以,实际上,如果你只想检查文件是否包含非ASCII字符,你可以说:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
#        ^
#        silent grep
Run Code Online (Sandbox Code Playgroud)

要删除这些字符,您可以使用:

sed -i.bak 's/[\d128-\d255]//g' file
Run Code Online (Sandbox Code Playgroud)

这将创建一个file.bak文件作为备份,而原始文件file将删除其非ASCII字符.参考从csv删除非ascii字符.

  • 冒着重申已经很明显的风险,有数百万个*有效* UTF-8 序列不是 ASCII。删除这些显然是错误的解决方案。 (2认同)

tri*_*eee 5

根据定义,您正在查看的内容已损坏。显然,您正在显示以Latin-1呈现的文件。三个字符�代表三个字节值0xEF 0xBF 0xBD。但是这些是Unicode REPLACEMENT CHARACTER U + FFFD的UTF-8编码,它是尝试将字节从未知或未定义的编码转换为UTF-8的结果,可以正确显示为?(如果您使用的是本世纪的浏览器,则应该会看到带有问号的黑色菱形;但这还取决于您使用的字体等)。

因此,您关于“如何发现”这种特殊现象的问题很容易解决;Unicode代码点U + FFFD毫无用处,这是您所暗示的过程中唯一可能的症状。

从这是编码有效Unicode代码点的有效UTF-8序列的意义上说,它们不是“无效的Unicode”或“无效的UTF-8”;只是该特定代码点的语义是“这是无法正确表示的字符的替换字符”,即无效输入。

至于如何首先防止它,答案确实很简单,但也没有什么意义。您需要确定何时以及如何进行不正确的编码,并修复产生此无效输出的过程。

要仅删除U + FFFD字符,请尝试类似

perl -CSD -pe 's/\x{FFFD}//g' file
Run Code Online (Sandbox Code Playgroud)

但是同样,正确的解决方案是首先不要产生这些错误的输出。

(您没有透露示例数据的编码。它可能还会其他损坏。如果您向我们展示的是数据的UTF-8呈现的复制/粘贴,则该数据已被“双重编码” “。换句话说,有人将UTF-8文本(如上所述已损坏),并告诉计算机将其从Latin-1转换为UTF-8。这样做很容易;只需将其“返回”即可到Latin-1。那么您获得的应该是多余的错误转换之前的原始UTF-8数据。)


Bou*_*mas 5

尝试此操作,以便从外壳中查找非ASCII字符。

命令:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt
Run Code Online (Sandbox Code Playgroud)

输出:

2 Pour être ou ne pas être
4 By? ?i neby?
5 ???
Run Code Online (Sandbox Code Playgroud)

  • 非 ASCII 与非 utf8 不同;OP询问utf8。 (3认同)

ASC*_*NSI 3

此 Perl 程序应删除所有非 ASCII 字符:

 foreach $file (@ARGV) {
   open(IN, $file);
   open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
   while (<IN>) {
     s/[^[:ascii:]]//g;
     print OUT "$_";
   }
   rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}
Run Code Online (Sandbox Code Playgroud)

它的作用是将文件作为命令行上的输入,如下所示:
perl fixutf8.pl foo bar baz
然后,对于每一行,它都将非 ASCII 字符的每个实例替换为任何内容(删除)。
然后,它将修改后的行写入super-temporary-utf8-replacement-file-which-should-never-be-used-EVER(命名为不会修改任何其他文件)。
然后,它将临时文件重命名为原始文件的名称。

它接受所有 ASCII 字符(包括 DEL、NUL、CR 等),以防您对它们有特殊用途。如果您只需要可打印字符,只需替换:ascii::print:in s///

我希望这有帮助!如果这不是您要找的,请告诉我。