Grep 不匹配非 ASCII 字符

Question

Grep 不匹配非 ASCII 字符

我发现了一个被认为是 UTF-8 编码的文本文件的有问题的序列。奇怪的是 grep 似乎无法匹配这个非 ASCII 行。

\n\n

$ iconv -f utf8 -t iso88591 corrupt_part.txt --output corrupt_part.txt.conv\niconv: illegal input sequence at position 8\n$ cat corrupt_part.txt\nOberallg\xef\xbf\xbdu\n$ grep -P -n \'[^\\x00-\\x7F]\' corrupt_part.txt\n$ od -h corrupt_part.txt\n0000000 624f 7265 6c61 676c 75e4 0a20\n0000014\n

Run Code Online (Sandbox Code Playgroud)\n\n

\\xe4例如，\xc3\xa4在扩展 ASCII 集中也是如此。然而，过滤控制和可打印字符（ascii 范围），上面的 grep 命令应该与\\xe4字符匹配。为什么我没有得到任何 grep 输出？

\n

Answer 1

Bar*_* IO 6

e4 75确实是非法的utf8序列。在utf8中，最高半字节等于0xe的字节引入了三字节序列。该序列的第二个字节不能是 0x75，因为第二个字节的高位半字节 (0x7) 不在 0x8 和 0xb 之间。

这解释了为什么 iconv 拒绝该文件作为无效的 utf8。也许它已经是 iso8859-1 了？

有关 utf8 编码的摘要，请参阅此维基百科表

至于你的 grep 问题，也许如果你指定 C/POSIX 语言环境，其中字符相当于字节：

LC_ALL=C grep -P -n '[^\x00-\x7F]' corrupt_part.txt

Run Code Online (Sandbox Code Playgroud)

使用旧的 Ubuntu 系统、GNU grep 和使用 en_US.UTF-8 语言环境的环境：

$ od -h bytes
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014
$ LC_ALL=C grep -P '[^\x00-\x7F]' bytes | od -h
0000000 624f 7265 6c61 676c 75e4 0a20
0000014

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，7 月前
查看次数：	6673 次
最近记录：	9 年，7 月前