计算Ruby文本文件中给定单词的频率

Xib*_*ion 0 ruby cpu-word find-occurrences

我希望能够计算文本文件中给定单词(例如输入)的出现次数。我有这个代码,它给了我文件中所有单词的出现:

word_count = {}
    my_word = id
    File.open("texte.txt", "r") do |f|
    f.each_line do |line|
    words = line.split(' ').each do |word|
      word_count[word] += 1 if word_count.has_key? my_word
      word_count[word] = 1 if not word_count.has_key? my_word
    end
  end
end

puts "\n"+ word_count.to_s
Run Code Online (Sandbox Code Playgroud)

谢谢你

Car*_*and 5

创建测试文件

让我们首先创建一个要使用的文件。

text =<<-BITTER_END
It was the best of times, it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief, it was the epoch of
incredulity, it was the season of Light, it was the season of Darkness, it was
the spring of hope, it was the winter of despair, we had everything before us,
we had nothing before us...
BITTER_END

FName = 'texte.txt'
File.write(FName, text)
  #=> 344
Run Code Online (Sandbox Code Playgroud)

指定要计算的单词

target = 'the'
Run Code Online (Sandbox Code Playgroud)

创建正则表达式

r = /\b#{target}\b/i
  #=> /\bthe\b/i
Run Code Online (Sandbox Code Playgroud)

\b例如,分词用于确保'anthem'不计为'the'

吞咽小文件

如果像这里一样,文件不是很大,你可以吞下它:

File.read("texte.txt").scan(r).count
  #=> 10
Run Code Online (Sandbox Code Playgroud)

逐行读取大文件

如果文件太大以至于我们想逐行读取它,请执行以下操作。

File.foreach(FName).reduce(0) { |cnt, line| cnt + line.scan(r).count }
  #=> 10
Run Code Online (Sandbox Code Playgroud)

或者

File.foreach(FName).sum { |line| line.scan(r).count }
  #=> 10
Run Code Online (Sandbox Code Playgroud)

请注意Enumerable#sum在 Ruby v2.4 中首次亮相。

请参阅IO::readIO::foreach。(IO.methodx...通常写成File.methodx...。这是允许的,因为它FileIO; 即的子类File < IO #=> true。)

使用 gsub 避免创建临时数组

第一种方法(吞咽文件)创建一个临时数组:

["the", "the", "the", "the", "the", "the", "the", "the", "the", "the"]
Run Code Online (Sandbox Code Playgroud)

应用到哪个count(又名size)。避免创建此数组的一种方法是使用String#gsub而不是String#scan,因为前者在没有块的情况下使用时,会返回一个枚举数:

File.read("texte.txt").gsub(r).count
  #=> 10
Run Code Online (Sandbox Code Playgroud)

这也可以用于文件的每一行。

这是一种非常规但有时有用的gsub.