Vil*_*age 33 bash file-io grep sed
我有这样一个文件:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Run Code Online (Sandbox Code Playgroud)
我想生成一个两列列表.第一列显示出现的单词,第二列显示出现的频率,例如:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
Run Code Online (Sandbox Code Playgroud)
words
并且word
可以算作两个单独的单词.到目前为止,我有这个:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
Run Code Online (Sandbox Code Playgroud)
出于某种原因,这只是在每个单词后面显示"0".
如何生成文件中出现的每个单词的列表以及频率信息?
edu*_*ffy 60
不sed
和grep
,但是tr
,sort
,uniq
,和awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
Run Code Online (Sandbox Code Playgroud)
Boh*_*dan 43
uniq -c已经做了你想要的,只需对输入进行排序:
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
Run Code Online (Sandbox Code Playgroud)
输出:
6 a
7 d
7 s
Run Code Online (Sandbox Code Playgroud)
Jer*_*ews 12
您可以为此使用 tr ,只需运行
tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
Run Code Online (Sandbox Code Playgroud)
城市名称文本文件的示例输出:
3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
Run Code Online (Sandbox Code Playgroud)
输入文件的内容
$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Run Code Online (Sandbox Code Playgroud)
运用 sed | sort | uniq
$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
1 a
2 appear
1 file
1 is
1 many
1 more
2 of
1 once
1 one
1 only
2 some
1 than
2 the
1 this
1 time
1 with
3 words
Run Code Online (Sandbox Code Playgroud)
uniq -ic
将计数并忽略大小写,但结果列表将具有This
而不是this
.
此函数按降序列出在提供的文件中出现的每个单词的频率:
function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}
Run Code Online (Sandbox Code Playgroud)
你可以像这样在你的文件上调用它:
$ cat your_file.txt | wordfrequency
Run Code Online (Sandbox Code Playgroud)
资料来源:AWK-ward Ruby