文件中最常用的100个字符串

Question

文件中最常用的100个字符串

Dyn*_*mic 1 sorting perl hash file cpu-word

如何.txt使用Perl 在文件中找到前100个最常用的字符串(单词)？到目前为止,我有以下内容:

use 5.012;
use warnings;

open(my $file, "<", "file.txt");

my %word_count;
while (my $line = <$file>) {
  foreach my $word (split ' ', $line) {
     $word_count{$word}++;
  } 
} 

for my $word (sort keys %word_count) {
  print "'$word': $word_count{$word}\n";
}

Run Code Online (Sandbox Code Playgroud)

但这只计算每个单词,并按字母顺序组织.我想要文件中前100个最常用的单词,按出现次数排序.有任何想法吗？

相关:计算文件perl中重复字符串的次数

Answer 1

tch*_*ist 8

通过阅读精细的perlfaq4(1)联机帮助页,可以了解如何按值对哈希进行排序.所以试试吧.它比你的方法更具惯用性"perlian".

#!/usr/bin/env perl    
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);

my %seen;
while (<>) {
    $seen{$_}++ for split /\W+/;  # or just split;
}

my $count = 0;
for (sort {
        $seen{$b} <=> $seen{$a}
                  ||
           lc($a) cmp lc($b)    # XXX: should be v5.16's fc() instead
                  ||
              $a  cmp  $b
     } keys %seen)
{
    next unless /\w/;
    printf "%-20s %5d\n", $_, $seen{$_};
    last if ++$count > 100;
}

Run Code Online (Sandbox Code Playgroud)

当对自己运行时,前10行输出是:

seen                     6
use                      5
_                        3
a                        3
b                        3
cmp                      2
count                    2
for                      2
lc                       2
my                       2

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	668 次
最近记录：	13 年，3 月前