如何在Scala中对Python或运算符进行近似设置比较?

Ste*_*ill 6 python scala anonymous-function

在听到最新的Stack Overflow播客后,Peter Norvig的紧凑型Python拼写检查器引起了我的兴趣,所以我决定在Scala中实现它,如果我能用功能Scala成语表达它,还要看看需要多少行代码.

这是整个问题.(我们还不比较代码行.)

(两个注意事项:如果你愿意,你可以在Scala解释器中运行它.如果你需要big.txt或整个项目的副本,它就在GitHub上.)

import scala.io.Source

val alphabet = "abcdefghijklmnopqrstuvwxyz"

def train(text:String) = {
  "[a-z]+".r.findAllIn(text).foldLeft(Map[String, Int]() withDefaultValue 1)
    {(a, b) => a(b) = a(b) + 1}
}

val NWORDS = train(Source.fromFile("big.txt").getLines.mkString.toLowerCase)

def known(words:Set[String]) =
  {Set.empty ++ (for(w <- words if NWORDS contains w) yield w)}

def edits1(word:String) = {
  Set.empty ++
  (for (i <- 0 until word.length) // Deletes
    yield (word take i) + (word drop (i + 1))) ++ 
  (for (i <- 0 until word.length - 1) // Transposes
    yield (word take i) + word(i + 1) + word(i) + (word drop (i + 2))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Replaces
    yield (word take i) + j + (word drop (i+1))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Inserts
    yield (word take i) + j + (word drop i))
}

def known_edits2(word:String) = {Set.empty ++ (for (e1 <- edits1(word);
  e2 <- edits1(e1) if NWORDS contains e2) yield e2)}

def correct(word:String) = {
  val options = Seq(() => known(Set(word)), () => known(edits1(word)),
    () => known_edits2(word), () => Set(word))
  val candidates = options.foldLeft(Set[String]())
    {(a, b) => if (a.isEmpty) b() else a}
  candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}
Run Code Online (Sandbox Code Playgroud)

具体来说,我想知道我能用这个correct功能做些什么更清洁.在原始Python中,实现更清晰:

def correct(word):
    candidates = known([word]) or known(edits1(word)) or
      known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)
Run Code Online (Sandbox Code Playgroud)

显然在Python中,空集将评估为布尔值False,因此只会评估返回非空集的第一个候选项,从而节省对edits1和的潜在昂贵的调用known_edits2.

我想出的唯一解决方案是你在这里看到的版本,其中Seq匿名函数被调用,直到一个返回非空Set,最后一个保证做.

经验丰富的Scala主管,是否有更复杂的语法或更好的方法呢?提前致谢!

Dan*_*wak 6

我不确定你为什么试图使用延迟评估known而不是简单地使用流来表示oxbow_lakes.做他做的更好的方式:

def correct(word: String) = {
  import Stream._

  val str = cons(known(Set(word)), 
            cons(known(edits1(word)),
            cons(known_edits2(word),
            cons(Set(word), empty))))

  str find { !_.isEmpty } match {
    case Some(candidates) =>
      candidates.foldLeft(Set[String]()) { (res, n) =>
        if (NWORDS(res) > NWORDS(n)) res else n
      }

    case None => Set()
  }
}
Run Code Online (Sandbox Code Playgroud)

利用Stream.cons已经懒惰的事实,因此我们不需要将所有内容包装在一个thunk中.

如果你真的想要很好的语法,我们可以为所有这些方法添加一些语法糖:

implicit def streamSyntax[A](tail: =>Stream[A]) = new {
  def #::(hd: A) = Stream.cons(hd, tail)
}
Run Code Online (Sandbox Code Playgroud)

现在我们以前的丑陋str定义如下:

def correct(word: String) = {
  val str = known(Set(word)) #:: known(edits1(word)) #::
            known_edits2(word) #:: Set(word) #:: Stream.empty

  ...
}
Run Code Online (Sandbox Code Playgroud)


Dav*_*low 2

迭代器也是惰性的(尽管不是很实用,因为你只能迭代它们一次。)所以,你可以这样做:

  def correct(word: String) = {
    val sets = List[String => Set[String]](
      x => known(Set(x)), x => known(edits1(x)), known_edits2
    ).elements.map(_(word))

    sets find { !_.isEmpty } match {
      case Some(candidates: Set[String]) => candidates.reduceLeft { (res, n) => if (NWORDS(res) > NWORDS(n)) res else n }
      case None => word
    }
  }
Run Code Online (Sandbox Code Playgroud)

作为奖励,Iterator 的 find() 方法不会强制计算下一个元素。