如何创建文件中每个单词的频率列表？

Question

如何创建文件中每个单词的频率列表？

我有这样一个文件:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

Run Code Online (Sandbox Code Playgroud)

我想生成一个两列列表.第一列显示出现的单词,第二列显示出现的频率,例如:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1

Run Code Online (Sandbox Code Playgroud)

为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母.
除非有一个简单的解决方案,words并且word可以算作两个单独的单词.

到目前为止,我有这个:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

Run Code Online (Sandbox Code Playgroud)

出于某种原因,这只是在每个单词后面显示"0".

如何生成文件中出现的每个单词的列表以及频率信息？

Answer 1

edu*_*ffy 60

不sed和grep,但是tr,sort,uniq,和awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

Run Code Online (Sandbox Code Playgroud)

好吧,只是修改你的解决方案,以删除标点符号和大写字母,以防它们被删除.此外,这将删除不必要的空格,挤压额外的空格并首先打印频率最高的单词:`cat file.txt | tr'[:punct:]'''| tr'AZ''az'| tr -s''| tr'''\n'| 排序| uniq -c | sort -rn` (9认同)

Answer 2

Boh*_*dan 43

uniq -c已经做了你想要的,只需对输入进行排序:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c

Run Code Online (Sandbox Code Playgroud)

输出:

  6 a
  7 d
  7 s

Run Code Online (Sandbox Code Playgroud)

我还建议在该行的末尾再添加一个`sort -n`，以使您的输出从最小到最大排序。 (5认同)

Answer 3

Jer*_*ews 12

您可以为此使用 tr ，只需运行

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

Run Code Online (Sandbox Code Playgroud)

城市名称文本文件的示例输出：

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton

Run Code Online (Sandbox Code Playgroud)

Answer 4

Ron*_*ony 7

输入文件的内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

Run Code Online (Sandbox Code Playgroud)

运用 sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

Run Code Online (Sandbox Code Playgroud)

uniq -ic将计数并忽略大小写,但结果列表将具有This而不是this.

Answer 5

She*_*yar 5

让我们使用AWK！

此函数按降序列出在提供的文件中出现的每个单词的频率：

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

Run Code Online (Sandbox Code Playgroud)

你可以像这样在你的文件上调用它：

$ cat your_file.txt | wordfrequency

Run Code Online (Sandbox Code Playgroud)

资料来源：AWK-ward Ruby

一行：`cat文件| awk '{for(i=1;i<=NF;++i){D[$i]++}}END{for(k in D)print k, D[k]}' | 排序-k2nr | 头-n 20` (2认同)

归档时间：	13 年，5 月前
查看次数：	50479 次
最近记录：	6 年，8 月前