现在有JSON数据如下
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
{"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]}
......
Run Code Online (Sandbox Code Playgroud)
此JSON是应用程序的激活时间,其目的是分析每个应用程序的总激活时间
我使用sparK SQL来解析JSON
斯卡拉
val sqlContext = sc.sqlContext
val behavior = sqlContext.read.json("behavior-json.log")
behavior.cache()
behavior.createOrReplaceTempView("behavior")
val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
appActiveTime.show (100100) // print dataFrame
appActiveTime.rdd.foreach(println) // print RDD
Run Code Online (Sandbox Code Playgroud)
但是打印的dataFrame是这样的
.
+----------------------------------------------------------------------+
| data|
+----------------------------------------------------------------------+
| [[60000, com.browser1], [12870000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1207000, com.browser]]|
| [[120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser5]]|
| [[60000, com.browser1], [12075000, com.browser]]|
| [[60000, com.browser1], …Run Code Online (Sandbox Code Playgroud)