我试图在scala中基于n-gram编写分离的印刷算法.如何为大文件生成n-gram:例如,对于包含"蜜蜂是蜜蜂的蜜蜂"的文件.
你能给我一些提示怎么做吗?抱歉给你带来不便.
per*_*i4n 13
你的问题可能会更具体一些,但这是我的尝试.
val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
Run Code Online (Sandbox Code Playgroud)
你可以用 n 的参数试试这个
val words = "the bee is the bee of the bees"
val w = words.split(" ")
val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println
List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4651 次 |
| 最近记录: |