由于镶木地板解析空数组的事实,我在写表之前将空数组替换为null.现在,当我阅读表格时,我想做相反的事情:
我有一个具有以下架构的DataFrame:
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)
以及以下内容:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
Run Code Online (Sandbox Code Playgroud)
我想用空数组替换null数组(id = 2),即
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
Run Code Online (Sandbox Code Playgroud)
我试过了:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
Run Code Online (Sandbox Code Playgroud)
这使 :
java.lang.ClassCastException:org.apache.spark.sql.types.NullType $无法强制转换为org.apache.spark.sql.types.StructType
编辑:我不想"硬编码"我的数组列的任何模式(至少不是结构的模式),因为这可能因情况而异.我只能df在运行时使用架构信息
我顺便使用Spark 2.1,因此我无法使用 …
假设我有一个DataFrame如下:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
Run Code Online (Sandbox Code Playgroud)
模式是:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
Run Code Online (Sandbox Code Playgroud)
我正在寻找一种只选择fields id和sizearray列子集的方法subClasss,但要保留嵌套的数组结构。结果模式将是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer …Run Code Online (Sandbox Code Playgroud) 我正在尝试在Scala中更改DataFrame列的名称。我可以轻松更改直接字段的列名,但在转换数组结构列时遇到困难。
以下是我的DataFrame模式。
|-- _VkjLmnVop: string (nullable = true)
|-- _KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- _MnoPqrstUv: string (nullable = true)
| | |-- _ManDevyIxyz: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
但我需要如下所示的架构
|-- vkj_lmn_vop: string (nullable = true)
|-- ka_tas_lop: string (nullable = true)
|-- abc_def: struct (nullable = true)
| |-- uvw_xyz: struct (nullable = true)
| | |-- mno_pqrst_uv: string (nullable = true)
| | |-- …Run Code Online (Sandbox Code Playgroud)