ASCII 源文件检查器

Dou*_*ies 3 command-line documentation text-processing

对于源英文文件在 docbook xml 中的官方 Ubuntu 文档,要求仅使用 ASCII 字符。我们使用“检查器”命令行(请参阅此处)。

grep --color='auto' -P -n "[\x80-\xFF]" *.xml
Run Code Online (Sandbox Code Playgroud)

但是,该命令有一个缺陷,显然不是在所有计算机上,它都会遗漏一些带有非 ASCII 字符的行,可能会导致错误的 OK 结果。

有没有人对 ASCII 检查器命令行有更好的建议?

有兴趣的人可以考虑使用这个文件(文本文件,而不是 docbook xml 文件)作为测试用例。带有非 ASCII 字符的前三行是第 9、14 和 18 行。检查中遗漏了第 14 和 18 行:

$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?
Run Code Online (Sandbox Code Playgroud)

Byt*_*der 5

您可以使用我在 GitHub 上托管的 Python 3 脚本打印文件的所有非 ASCII 行:

GitHub: ByteCommander/编码检查

您可以克隆或下载整个存储库,也可以简单地保存文件encoding-check并使用chmod +x encoding-check.

然后你可以像这样运行它,将文件作为唯一的参数进行检查:

  • ./encoding-check FILENAME 如果它位于您当前的工作目录中,或者...
  • /path/to/encoding-check FILENAME如果它位于/path/to/, 或...
  • encoding-check FILENAME如果它位于作为$PATH环境变量一部分的目录中,即/usr/local/bin~/bin.

没有任何可选参数,它将打印每一行及其找到非 ASCII 字符的行号。最后,有一个摘要行告诉您文件总共有多少行以及其中有多少包含非 ASCII 字符。

此方法保证正确解码所有 ASCII 字符并检测绝对不是 ASCII 的所有内容。

这是在包含给定的前 20 行的文件上运行的示例install.en.txt

$ ./encoding-check install-first20.en.txt
     9: Appendix??F, GNU General Public License.
    14: (codename "???Xenial Xerus???"), for the 64-bit PC ("amd64") architecture. It also
    18: ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.
Run Code Online (Sandbox Code Playgroud)

但是脚本有一些额外的参数来调整检查的编码和输出格式。查看帮助并尝试它们:

$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
                     [-t]
                     FILE [FILE ...]

Show all lines of a FILE containing characters that don't match the selected
ENCODING.

positional arguments:
  FILE                  the file to be examined

optional arguments:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        file encoding to test (default 'ascii')
  -s, --summary         only print the summary
  -c, --count           only print the detected line count
  -l, --lines           only print the detected lines
  -m, --only-matching   hide files without matching lines from output
  -w, --no-warnings     hide warnings from output
  -n, --no-numbers      do not show line numbers in output
  -f N, --fit-width N   trim lines to N characters, or terminal width if N=0;
                        non-printable characters like tabs will be removed
  -t, --title           print title line above each file
Run Code Online (Sandbox Code Playgroud)

因为--encoding,Python 3 知道的每个编解码器都是有效的。尝试一个,在最坏的情况下,您会收到一条小错误消息...


mur*_*uru 5

如果您想查找非 ASCII 字符,也许您应该反转搜索以排除 ASCII 字符:

grep -Pn '[^\x00-\x7F]'
Run Code Online (Sandbox Code Playgroud)

例如:

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:???????????????????????????????????????????????????????????????????????????????
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software
Run Code Online (Sandbox Code Playgroud)

在第 9、330、337 和 359 行中,存在Unicode 不间断空格字符


您获得的特定输出可能是由于grep对 UTF-8 的支持。对于 Unicode 语言环境,其中一些字符可能与普通 ASCII 字符相当。在这种情况下,强制 C 语言环境将显示预期结果:

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:???????????????????????????????????????????????????????????????????????????????
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
Run Code Online (Sandbox Code Playgroud)