use*_*196 43 linux bash utf-8 character-encoding
我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符.
�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
Run Code Online (Sandbox Code Playgroud)
我试过的:
iconv -f utf-8 -t utf-8 -c file.csv
Run Code Online (Sandbox Code Playgroud)
这将文件从utf-8编码转换为utf-8编码,-c用于跳过无效的utf-8字符.然而最后这些非法字符仍然被打印出来.在linux或其他语言的bash中还有其他解决方案吗?
Bla*_*laf 51
假设您的语言环境设置为UTF-8,这可以很好地识别无效的UTF-8序列:
grep -axv '.*' file.txt
Run Code Online (Sandbox Code Playgroud)
fed*_*qui 14
我想要grep非ASCII字符.
使用带有pcre的GNU grep(因为-P,总是不可用.在FreeBSD上你可以在包pcre2中使用pcregrep)你可以这样做:
grep -P "[\x80-\xFF]" file
Run Code Online (Sandbox Code Playgroud)
如何grep中的参考UNIX中的所有非ASCII字符.所以,实际上,如果你只想检查文件是否包含非ASCII字符,你可以说:
if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
# ^
# silent grep
Run Code Online (Sandbox Code Playgroud)
要删除这些字符,您可以使用:
sed -i.bak 's/[\d128-\d255]//g' file
Run Code Online (Sandbox Code Playgroud)
这将创建一个file.bak文件作为备份,而原始文件file将删除其非ASCII字符.参考从csv中删除非ascii字符.
根据定义,您正在查看的内容已损坏。显然,您正在显示以Latin-1呈现的文件。三个字符�代表三个字节值0xEF 0xBF 0xBD。但是这些是Unicode REPLACEMENT CHARACTER U + FFFD的UTF-8编码,它是尝试将字节从未知或未定义的编码转换为UTF-8的结果,可以正确显示为?(如果您使用的是本世纪的浏览器,则应该会看到带有问号的黑色菱形;但这还取决于您使用的字体等)。
因此,您关于“如何发现”这种特殊现象的问题很容易解决;Unicode代码点U + FFFD毫无用处,这是您所暗示的过程中唯一可能的症状。
从这是编码有效Unicode代码点的有效UTF-8序列的意义上说,它们不是“无效的Unicode”或“无效的UTF-8”;只是该特定代码点的语义是“这是无法正确表示的字符的替换字符”,即无效输入。
至于如何首先防止它,答案确实很简单,但也没有什么意义。您需要确定何时以及如何进行不正确的编码,并修复产生此无效输出的过程。
要仅删除U + FFFD字符,请尝试类似
perl -CSD -pe 's/\x{FFFD}//g' file
Run Code Online (Sandbox Code Playgroud)
但是同样,正确的解决方案是首先不要产生这些错误的输出。
(您没有透露示例数据的编码。它可能还会有其他损坏。如果您向我们展示的是数据的UTF-8呈现的复制/粘贴,则该数据已被“双重编码” “。换句话说,有人将UTF-8文本(如上所述已损坏),并告诉计算机将其从Latin-1转换为UTF-8。这样做很容易;只需将其“返回”即可到Latin-1。那么您获得的应该是多余的错误转换之前的原始UTF-8数据。)
尝试此操作,以便从外壳中查找非ASCII字符。
命令:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
Run Code Online (Sandbox Code Playgroud)
输出:
2 Pour être ou ne pas être
4 By? ?i neby?
5 ???
Run Code Online (Sandbox Code Playgroud)
此 Perl 程序应删除所有非 ASCII 字符:
foreach $file (@ARGV) {
open(IN, $file);
open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
while (<IN>) {
s/[^[:ascii:]]//g;
print OUT "$_";
}
rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}
Run Code Online (Sandbox Code Playgroud)
它的作用是将文件作为命令行上的输入,如下所示:
perl fixutf8.pl foo bar baz
然后,对于每一行,它都将非 ASCII 字符的每个实例替换为任何内容(删除)。
然后,它将修改后的行写入super-temporary-utf8-replacement-file-which-should-never-be-used-EVER(命名为不会修改任何其他文件)。
然后,它将临时文件重命名为原始文件的名称。
它接受所有 ASCII 字符(包括 DEL、NUL、CR 等),以防您对它们有特殊用途。如果您只需要可打印字符,只需替换:ascii: 为:print:in s///。
我希望这有帮助!如果这不是您要找的,请告诉我。
| 归档时间: |
|
| 查看次数: |
42156 次 |
| 最近记录: |