最近我有一个场景将数据存储在keyValue对中并遇到了一个函数reduceByKey(_ ++ _).这更像是速记语法.我无法理解这实际意味着什么.
例如:reduceBykey(_ + _) 意思是reduceByKey((a,b)=>(a+b))
所以reduceByKey(_ ++ _)意味着??
我能够使用数据创建Key value对reduceByKey(_ ++ _).
val y = sc.textFile("file:///root/My_Spark_learning/reduced.txt")
y.map(value=>value.split(","))
.map(value=>(value(0),value(1),value(2)))
.collect
.foreach(println)
(1,2,3)
(1,3,4)
(4,5,6)
(7,8,9)
y.map(value=>value.split(","))
.map(value=>(value(0),Seq(value(1),value(2))))
.reduceByKey(_ ++ _)
.collect
.foreach(println)
(1,List(2, 3, 3, 4))
(4,List(5, 6))
(7,List(8, 9))
Run Code Online (Sandbox Code Playgroud) 我已经嵌套了JSON,并且希望以表格结构进行输出。我能够单独解析JSON值,但是在将其制成表格时存在一些问题。我能够通过dataframe轻松实现。但是我想使用“仅RDD”功能。任何帮助,不胜感激。
输入JSON:
{ "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}
Run Code Online (Sandbox Code Playgroud)
预期产量:
我尝试了以下Spark代码。但是获取这样的输出和Row()对象无法解析此内容。
079562193,EA,列表(SELLABLE,HELD),列表(2015-10-09T00:55:23.6345Z,2015-10-09T00:55:23.6345Z),列表(1400.0,800.0),列表(SINGLE,SINGLE)
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
)) …Run Code Online (Sandbox Code Playgroud)