ukb*_*baz 2 scala scala-collections apache-spark apache-spark-sql
我有一个从一些 JSON 创建的 RDD,RDD 中的每条记录都包含键/值对。我的 RDD 看起来像:
myRdd.foreach(println)
Run Code Online (Sandbox Code Playgroud)
myRdd.foreach(println)
Run Code Online (Sandbox Code Playgroud)
我会将每条记录转换为 spark 数据框中的一行, trackingInfo 中的嵌套字段应该有自己的列,type列表也应该是自己的列。
到目前为止,我已经厌倦了使用案例类来拆分它:
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
Run Code Online (Sandbox Code Playgroud)
但是我不断收到java.lang.ArrayIndexOutOfBoundsException: 1错误。
做这个的最好方式是什么 ?如您所见,第 5 条记录在某些属性的顺序上略有不同。是否可以根据属性名称进行解析而不是根据“,”等进行拆分?
我正在使用 Spark 1.6.x
您的json rdd似乎无效jsons。您需要将它们转换为有效jsons的
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
Run Code Online (Sandbox Code Playgroud)
然后您可以使用sqlContext将有效读rdd jsons入dataframeas
val df = sqlContext.read.json(validJsonRdd)
Run Code Online (Sandbox Code Playgroud)
这应该为您提供数据框(我使用了您在问题中提供的无效 json)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
Run Code Online (Sandbox Code Playgroud)
数据框的架构是
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
Run Code Online (Sandbox Code Playgroud)
我希望答案有帮助
| 归档时间: |
|
| 查看次数: |
3129 次 |
| 最近记录: |