Ébe*_*aac 8 scala apache-spark
我想迭代 Spark 中的模式。使用给出嵌套和df.schema
的列表。StructType
StructFields
根元素可以像这样索引。
IN: val temp = df.schema
IN: temp(0)
OUT: StructField(A,StringType,true)
IN: temp(3)
OUT: StructField(D,StructType(StructField(D1,StructType(StructField(D11,StringType,true), StructField(D12,StringType,true), StructField(D13,StringType,true)),true), StructField(D2,StringType,true), StructField(D3,StringType,true)),true)
Run Code Online (Sandbox Code Playgroud)
当我尝试访问嵌套时StructType
,会发生以下情况
IN: val temp1 = temp(3).dataType
IN: temp1(0)
OUT:
Name: Unknown Error
Message: <console>:38: error: org.apache.spark.sql.types.DataType does not take parameters
temp1(0)
^
StackTrace:
Run Code Online (Sandbox Code Playgroud)
我不明白的是, 和temp
都是temp1
该类的StructType
,但是temp
是可迭代的,但temp1
不是。
IN: temp.getClass
OUT: class org.apache.spark.sql.types.StructType
IN: temp1.getClass
OUT: class org.apache.spark.sql.types.StructType
Run Code Online (Sandbox Code Playgroud)
我也尝试过dtypes
,但在尝试访问嵌套元素时最终遇到了类似的问题。
IN: df.dtypes(3)(0)
OUT:
Name: Unknown Error
Message: <console>:36: error: (String, String) does not take parameters
df.dtypes(3)(0)
^
StackTrace:
Run Code Online (Sandbox Code Playgroud)
那么,在了解子字段之前如何遍历模式呢?
好吧,如果您想要所有嵌套列的列表,您可以编写这样的递归函数
鉴于:
val schema = StructType(
StructField("name", StringType) ::
StructField("nameSecond", StringType) ::
StructField("nameDouble", StringType) ::
StructField("someStruct", StructType(
StructField("insideS", StringType) ::
StructField("insideD", StructType(
StructField("inside1", StringType) :: Nil
)) ::
Nil
)) ::
Nil
)
val rdd = session.sparkContext.emptyRDD[Row]
val df = session.createDataFrame(rdd, schema)
df.printSchema()
Run Code Online (Sandbox Code Playgroud)
这将产生:
root
|-- name: string (nullable = true)
|-- nameSecond: string (nullable = true)
|-- nameDouble: string (nullable = true)
|-- someStruct: struct (nullable = true)
| |-- insideS: string (nullable = true)
| |-- insideD: struct (nullable = true)
| | |-- inside1: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
如果您想要列的全名列表,您可以编写如下内容:
def fullFlattenSchema(schema: StructType): Seq[String] = {
def helper(schema: StructType, prefix: String): Seq[String] = {
val fullName: String => String = name => if (prefix.isEmpty) name else s"$prefix.$name"
schema.fields.flatMap {
case StructField(name, inner: StructType, _, _) =>
fullName(name) +: helper(inner, fullName(name))
case StructField(name, _, _, _) => Seq(fullName(name))
}
}
helper(schema, "")
}
Run Code Online (Sandbox Code Playgroud)
将返回:
ArraySeq(name, nameSecond, nameDouble, someStruct, someStruct.insideS, someStruct.insideD, someStruct.insideD.inside1)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
7586 次 |
最近记录: |