Ank*_*ita 0 scala tuples apache-spark
如何从以下存在的RDD创建元组?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
Run Code Online (Sandbox Code Playgroud)
b.txt数据集 - >
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Run Code Online (Sandbox Code Playgroud)
如果您打算拥有,Array[Tuples4]那么您可以执行以下操作
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Run Code Online (Sandbox Code Playgroud)
然后你可以访问每个字段 tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Run Code Online (Sandbox Code Playgroud)
更新
如果您有可变大小的输入文件
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
Run Code Online (Sandbox Code Playgroud)
你可以写匹配案例模式匹配为
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Run Code Online (Sandbox Code Playgroud)
再次更新
正如@eliasah指出的那样,以上程序是一种不好的做法product iterator.根据他的建议,我们应该知道输入数据的最大元素,并使用以下逻辑,我们为无元素分配默认值
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
Run Code Online (Sandbox Code Playgroud)
正如@philantrovert指出的那样,如果我们不使用,我们可以通过以下方式验证输出 REPL
arrayTuples.foreach(println)
Run Code Online (Sandbox Code Playgroud)
结果是
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1662 次 |
| 最近记录: |