获取文档中每个单词的出现次数

Question

获取文档中每个单词的出现次数

如何查找文件中每个单词的计数？

我想要文本管道或文档中每个单词的直方图。文档中将存在新行和空行。我把除了之外的所有东西都脱光了[a-zA-Z]。

> cat doc.txt 
word second third 

word really
> cat doc.txt | ... # then count occurrences of each word \
                    # and print in descending order separated by delimiter
word 2
really 1
second 1
third 1

Run Code Online (Sandbox Code Playgroud)

它需要具有一定的效率，因为文件是 1GB 文本，并且无法处理指数时间负载。

Answer 1

pLu*_*umo 7

尝试这个：

grep -o '\w*' doc.txt | sort | uniq -c | sort -nr

Run Code Online (Sandbox Code Playgroud)

-o打印每个匹配项而不是匹配行
\w*匹配单词字符
sort在管道传输到之前对匹配项进行排序uniq。
uniq -c打印唯一行和出现的次数-c
sort -nr按出现次数进行反向排序。

输出：

  2 word
  1 third
  1 second
  1 really

Run Code Online (Sandbox Code Playgroud)

选择：

用于awk精确输出：

$ grep -o '\w*' doc.txt \
| awk '{seen[$0]++} END{for(s in seen){print s,seen[s]}}' \
| sort -k2r

word 2
really 1
second 1
third 1

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，3 月前
查看次数：	1186 次
最近记录：	5 年，3 月前