Scala阅读文件与Spark

bou*_*sse 1 scala apache-spark

我试图读取一个看起来像这样的文件:

you 0.0432052044116
i 0.0391075831328
the 0.0328010698268
to 0.0237549924919
a 0.0209682886489
it 0.0198104294359
Run Code Online (Sandbox Code Playgroud)

我想将它存储在RDD(键,值)中(例如,你,0.0432).目前我只做了那个算法

val filename = "freq2.txt"
try {
for (line <- Source.fromFile(filename).getLines()) {
    val tuple = line.split(" ")
    val key = tuple(0)
    val words = tuple(1)
    println(s"${key}")
    println(s"${words}")
  }

} catch {
  case ex: FileNotFoundException => println("Couldn't find that file.")
  case ex: IOException => println("Had an IOException trying to read that file")
}
Run Code Online (Sandbox Code Playgroud)

但我不知道如何存储数据......

axi*_*iom 6

您可以直接将数据读入RDD:

val FIELD_SEP = " " //or whatever you have
val dataset = sparkContext.textFile(sourceFile).map(line => {
    val word::score::other = line.split(FIELD_SEP).toList
    (word, score)
})
Run Code Online (Sandbox Code Playgroud)