Dra*_*ick 2 json scala apache-spark apache-spark-sql
我有一个以XML形式出现的数据集,其中一个节点包含JSON.Spark正在将其作为StringType读取,因此我尝试使用from_json()将JSON转换为DataFrame.
我能够转换一串JSON,但是如何编写模式以使用数组呢?
没有数组的字符串 - 工作得很好
import org.apache.spark.sql.functions._
val schemaExample = new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")
val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))
dfICanWorkWith.collect()
// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])
Run Code Online (Sandbox Code Playgroud)
带有数组的字符串 - 无法解决这个问题
import org.apache.spark.sql.functions._
val schemaExample2 = new StructType()
.add("", ArrayType(new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
)
)
val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
dfICanWorkWith.collect()
// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
Run Code Online (Sandbox Code Playgroud)
问题是你没有完全合格的json.你的json缺少一些东西:
尝试将其替换为:
val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")
Run Code Online (Sandbox Code Playgroud)
你会得到:
scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
Run Code Online (Sandbox Code Playgroud)