使用命令行中的停用词列表在文件中查找 n 个最常用的词

Question

使用命令行中的停用词列表在文件中查找 n 个最常用的词

Jas*_*sta 4 command-line text-processing

我想使用停用词列表在文本文件中找到最常用的词。我已经有了这个代码：

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10 > test.txt

Run Code Online (Sandbox Code Playgroud)

来自旧帖子，但我的文件包含以下内容：

240 
 21 ipsum
 20 Lorem
 11 Textes
 9 Blindtexte
 7 Text
 5 F
 5 Blindtext
 4 Texte
 4 Buchstaben

Run Code Online (Sandbox Code Playgroud)

第一个只是一个空格，在文本中它们是标点符号（如点），但我不想要这个，所以我必须添加什么？

Answer 1

Joh*_*024 6

考虑这个测试文件：

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

Run Code Online (Sandbox Code Playgroud)

要获得字数：

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

Run Code Online (Sandbox Code Playgroud)

这个怎么运作

grep -oE '[[:alpha:]]+' text.txt

这将返回所有单词，减去任何空格或标点符号，每行一个单词。
sort

这将按字母顺序对单词进行排序。
uniq -c

这会计算每个单词出现的次数。（为了uniq工作，它的输入必须被排序。）
sort -nr

这会按数字对输出进行排序，以便最常用的单词位于顶部。

处理混合情况

考虑这个混合大小写的测试文件：

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

Run Code Online (Sandbox Code Playgroud)

如果我们想计算some和Some一样：

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

Run Code Online (Sandbox Code Playgroud)

在这里，我们添加了-f选项以sort使其忽略大小写，并添加-i选项以uniq使其也忽略大小写。

排除停用词

假设我们要从计数中排除这些停用词：

$ cat stopwords 
with
not
has
do

Run Code Online (Sandbox Code Playgroud)

因此，我们添加grep -v以消除这些词：

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	4300 次
最近记录：	9 年前