use*_*466 2 string scala join apache-spark rdd
我有两个RDDS:
rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark 
Run Code Online (Sandbox Code Playgroud)
我正在尝试使用该函数加入这两个RDD:rdd1.join(rdd2) 
但是我收到一个错误:
 error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]
连接应该加入RDD [String],输出RDD应该是这样的:
rddOutput : Name,Address,Zipcode,Landmark
Run Code Online (Sandbox Code Playgroud)
我想最终将这些文件保存为JSON文件.
有人能帮我一样吗?
如评论中所述,您必须在加入之前将RDD转换为PairRDD,这意味着每个RDD必须是类型RDD[(key, value)].只有这样,您才能通过密钥执行连接.在您的情况下,密钥由(名称,地址)组成,因此您必须执行以下操作:
// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }
// Now we can join them. 
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map { 
  case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark) 
}
Run Code Online (Sandbox Code Playgroud)
有关PairRDD函数的更多信息,请参阅Spark的Scala API文档
|   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           9224 次  |  
        
|   最近记录:  |