Ruby,Count音节

Rai*_*Son 8 ruby nlp

我正在使用ruby来计算我拥有的一些内容的Gunning Fog Index,我可以成功实现这里描述的算法:

Gunning雾指数

我使用以下方法来计算每个单词中的音节数:

Tokenizer = /([aeiouy]{1,3})/

def count_syllables(word)

  len = 0

  if word[-3..-1] == 'ing' then
    len += 1
    word = word[0...-3]
  end

  got = word.scan(Tokenizer)
  len += got.size()

  if got.size() > 1 and got[-1] == ['e'] and
      word[-1].chr() == 'e' and
      word[-2].chr() != 'l' then
    len -= 1
  end

  return len

end
Run Code Online (Sandbox Code Playgroud)

它有时只用2个音节来拾取有3个音节的单词.任何人都可以提出任何建议或者知道更好的方法吗?

text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."

# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')

word_array = text.split(' ')

word_array.each do |word|
    puts word if count_syllables(word) > 2
end
Run Code Online (Sandbox Code Playgroud)

"他们自己"被算作3,但它只有2

Pes*_*sto 11

我之前给你的功能是基于这里概述的这些简单规则:

单词中的每个元音(a,e,i,o,u,y)都算作一个音节,受以下子规则的约束:

  • 忽略最后的-ES,-ED,-E(-LE除外)
  • 三个字母或更少的字数计为一个音节
  • 连续元音计为一个音节.

这是代码:

def new_count(word)
  word.downcase!
  return 1 if word.length <= 3
  word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
  word.sub!(/^y/, '')
  word.scan(/[aeiouy]{1,2}/).size
end
Run Code Online (Sandbox Code Playgroud)

显然,这也不是完美的,但你所能得到的东西都是启发式的.

编辑:

我稍微更改了代码以处理前导'y'并修复了正则表达式以更好地处理'les'结尾(例如在"蜡烛"中).

以下是使用问题中的文字进行的比较:

# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')

words = text.split(' ')

words.each do |word|
  old = count_syllables(word.dup)
  new = new_count(word.dup)
  puts "#{word}: \t#{old}\t#{new}" if old != new
end
Run Code Online (Sandbox Code Playgroud)

输出是:

logorrhoea:     3   4
used:   2   1
makes:  2   1
themselves:     3   2
Run Code Online (Sandbox Code Playgroud)

所以它似乎是一种进步.