我有一个具有以下结构的数据帧:
|-- data: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- keyNote: struct (nullable = true)
| | |-- key: string (nullable = true)
| | |-- note: string (nullable = true)
| |-- details: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
Run Code Online (Sandbox Code Playgroud)
如何展平结构并创建新的数据框:
|-- id: long (nullable = true)
|-- keyNote: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- note: …Run Code Online (Sandbox Code Playgroud) 我想执行类似于 pandas.io.json.json_normalize 是 pyspark 数据帧的操作。Spark中有类似的功能吗?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
我正在使用此处给出的方法在Spark SQL中展平一个DataFrame。这是我的代码:
package com.acme.etl.xml
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, SparkSession}
object RuntimeError { def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
val rowTag = "idocData"
val dataFrameReader =
spark.read
.option("rowTag", rowTag)
val xmlUri = "bad_011_1.xml"
val df =
dataFrameReader
.format("xml")
.load(xmlUri)
val schema: StructType = df.schema
val columns: Array[Column] = flattenSchema(schema)
val df2 = df.select(columns: _*)
}
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName: String = if (prefix …Run Code Online (Sandbox Code Playgroud) PySpark 中遇到的错误:
pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"
Run Code Online (Sandbox Code Playgroud)
数据结构:
-- result_set: struct (nullable = true)
| |-- currency: string (nullable = true)
| |-- dates: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- date: string (nullable = true)
| | | |-- trackers: array (nullable = true)
| | | | …Run Code Online (Sandbox Code Playgroud) 我希望收集嵌套模式中所有字段的名称。数据是从 json 文件导入的。
该架构如下所示:
root
|-- column_a: string (nullable = true)
|-- column_b: string (nullable = true)
|-- column_c: struct (nullable = true)
| |-- nested_a: struct (nullable = true)
| | |-- double_nested_a: string (nullable = true)
| | |-- double_nested_b: string (nullable = true)
| | |-- double_nested_c: string (nullable = true)
| |-- nested_b: string (nullable = true)
|-- column_d: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
如果我使用df.schema.fieldsordf.schema.names它只是打印列层的名称 - 没有嵌套列。
我想要的期望输出是一个 python 列表,其中包含所有列名称,例如:
['column_a', 'columb_b', 'column_c.nested_a.double_nested.a', …Run Code Online (Sandbox Code Playgroud) 我有一个 Spark 作业,它有一个具有以下值的 DataFrame:
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type": {
"isMale": true,
"id": "dd",
"mcc": 1234,
"name": "Adam"
}
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2": {
"isMale": true,
"id": "dd",
"mcc": 12134,
"name": "Perth"
}
}
}
Run Code Online (Sandbox Code Playgroud)
我想优雅地将它展平(因为没有未知的键和类型等),这样道具仍然是一个,struct但里面的所有东西都被展平了(不管嵌套的级别如何)
所需的输出是:
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type.isMale": true,
"type.id": "dd",
"type.mcc": 1234,
"type.name": "Adam"
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2.isMale": true,
"type2.id": "dd",
"type2.mcc": 12134,
"type2.name": "Perth" …Run Code Online (Sandbox Code Playgroud)