jmv*_*llt 5 scala dataframe apache-spark apache-spark-sql
假设我有一个DataFrame如下:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
Run Code Online (Sandbox Code Playgroud)
模式是:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
Run Code Online (Sandbox Code Playgroud)
我正在寻找一种只选择fields id和sizearray列子集的方法subClasss,但要保留嵌套的数组结构。结果模式将是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我试图做一个
df.select("subClasss.id","subClasss.size")
Run Code Online (Sandbox Code Playgroud)
但这将数组subClasss分为两个数组:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
Run Code Online (Sandbox Code Playgroud)
有没有办法保持原点结构并消除useless场呢?看起来像:
df.select("subClasss.[id,size]")
Run Code Online (Sandbox Code Playgroud)
谢谢你的时间。
火花> = 2.4:
它可以使用arrays_zip具有cast:
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip(
$"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))
Run Code Online (Sandbox Code Playgroud)
其中cast需要重命名嵌套字段 -没有它星火自动生成的名称使用0,1,... n。
火花<2.4:
您可以这样使用UDF:
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2541 次 |
| 最近记录: |