如何创建文件中每个单词的频率列表?

Vil*_*age 33 bash file-io grep sed

我有这样一个文件:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Run Code Online (Sandbox Code Playgroud)

我想生成一个两列列表.第一列显示出现的单词,第二列显示出现的频率,例如:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1 
Run Code Online (Sandbox Code Playgroud)
  • 为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母.
  • 除非有一个简单的解决方案,words并且word可以算作两个单独的单词.

到目前为止,我有这个:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
Run Code Online (Sandbox Code Playgroud)

出于某种原因,这只是在每个单词后面显示"0".

如何生成文件中出现的每个单词的列表以及频率信息?

edu*_*ffy 60

sedgrep,但是tr,sort,uniq,和awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
Run Code Online (Sandbox Code Playgroud)

  • 好吧,只是修改你的解决方案,以删除标点符号和大写字母,以防它们被删除.此外,这将删除不必要的空格,挤压额外的空格并首先打印频率最高的单词:`cat file.txt | tr'[:punct:]'''| tr'AZ''az'| tr -s''| tr'''\n'| 排序| uniq -c | sort -rn` (9认同)

Boh*_*dan 43

uniq -c已经做了你想要的,只需对输入进行排序:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
Run Code Online (Sandbox Code Playgroud)

输出:

  6 a
  7 d
  7 s
Run Code Online (Sandbox Code Playgroud)

  • 我还建议在该行的末尾再添加一个`sort -n`,以使您的输出从最小到最大排序。 (5认同)

Jer*_*ews 12

您可以为此使用 tr ,只需运行

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
Run Code Online (Sandbox Code Playgroud)

城市名称文本文件的示例输出:

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
Run Code Online (Sandbox Code Playgroud)


Ron*_*ony 7

输入文件的内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Run Code Online (Sandbox Code Playgroud)

运用 sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words
Run Code Online (Sandbox Code Playgroud)

uniq -ic将计数并忽略大小写,但结果列表将具有This而不是this.


She*_*yar 5

让我们使用AWK!

此函数按降序列出在提供的文件中出现的每个单词的频率:

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}
Run Code Online (Sandbox Code Playgroud)

你可以像这样在你的文件上调用它:

$ cat your_file.txt | wordfrequency
Run Code Online (Sandbox Code Playgroud)

资料来源:AWK-ward Ruby

  • 一行:`cat文件| awk '{for(i=1;i&lt;=NF;++i){D[$i]++}}END{for(k in D)print k, D[k]}' | 排序-k2nr | 头-n 20` (2认同)