Ran*_*nty 2 dataframe avro apache-spark apache-spark-sql spark-dataframe
我有一个具有嵌套结构的DataFrame(最初是mapreduce作业的Avro输出)。我想弄平它。原始DataFrame的架构如下所示(简化):
|-- key: struct
| |-- outcome: boolean
| |-- date: string
| |-- age: int
| |-- features: map
| | |-- key: string
| | |-- value: double
|-- value: struct (nullable = true)
| |-- nullString: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
在Json表示中,一行数据如下所示:
{"key":
{"outcome": false,
"date": "2015-01-01",
"age" : 20,
"features": {
{"f1": 10.0,
"f2": 11.0,
...
"f100": 20.1
}
},
"value": null
}
Run Code Online (Sandbox Code Playgroud)
该features地图对所有行相同的结构,即关键的设置是一样的(F1,F2,...,F100)。“扁平化”是指以下内容。
+----------+----------+---+----+----+-...-+------+
| outcome| date|age| f1| f2| ... | f100|
+----------+----------+---+----+----+-...-+------+
| true|2015-01-01| 20|10.0|11.0| ... | 20.1|
...
(truncated)
Run Code Online (Sandbox Code Playgroud)
我正在使用来自https://github.com/databricks/spark-avro的 Spark 2.1.0 spark-avro软件包。
原始数据帧由
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
| key| value|
+--------------------+------+
|[false,2015... |[null]|
|[false,2015... |[null]|
...
(truncated)
Run Code Online (Sandbox Code Playgroud)
任何帮助是极大的赞赏!
在Spark中,您可以从嵌套的AVRO文件中提取数据。例如,您提供的JSON:
{"key":
{"outcome": false,
"date": "2015",
"features": {
{"f1": v1,
"f2": v2,
...
}
},
"value": null
}
Run Code Online (Sandbox Code Playgroud)
从AVRO读取后:
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
Run Code Online (Sandbox Code Playgroud)
可以从嵌套JSON提供扁平化数据。为此,您可以编写如下代码:
df.select("key.*").show
+----+------------+-------+
|date| features |outcome|
+----+------------+-------+
|2015| [v1,v2,...]| false|
+----+------------+-------+
...
(truncated)
df.select("key.*").printSchema
root
|-- date: string (nullable = true)
|-- features: struct (nullable = true)
| |-- f1: string (nullable = true)
| |-- f2: string (nullable = true)
| |-- ...
|-- outcome: boolean (nullable = true)
Run Code Online (Sandbox Code Playgroud)
或类似这样的东西:
df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---
...
(truncated)
df.select("key.features.*").printSchema
root
|-- f1: string (nullable = true)
|-- f2: string (nullable = true)
|-- ...
Run Code Online (Sandbox Code Playgroud)
如果这是您期望的输出。