Zz'*_*Rot 3 python dataframe apache-spark apache-spark-sql pyspark
我正在尝试将具有嵌套struct类型(见下文)的DataFrame列扩展为多个列.我正在使用的Struct模式看起来像{"foo": 3, "bar": {"baz": 2}}.
理想情况下,我想将上面的内容扩展为两列("foo"和"bar.baz").然而,当我尝试使用.select("data.*")(其中data是结构柱),我只得到列foo和bar,其中bar仍然是一个struct.
有没有办法可以扩展两个图层的Struct?
Zz'*_*Rot 14
我最终选择了以下递归“解开”分层结构的函数:
本质上,它不断挖掘Struct字段并保持其他字段完好无损,这种方法消除了df.select(...)在Struct有很多字段时需要很长的语句。这是代码:
# Takes in a StructType schema object and return a column selector that flattens the Struct
def flatten_struct(schema, prefix=""):
result = []
for elem in schema:
if isinstance(elem.dataType, StructType):
result += flatten_struct(elem.dataType, prefix + elem.name + ".")
else:
result.append(col(prefix + elem.name).alias(prefix + elem.name))
return result
df = sc.parallelize([Row(r=Row(a=1, b=Row(foo="b", bar="12")))]).toDF()
df.show()
+----------+
| r|
+----------+
|[1,[12,b]]|
+----------+
df_expanded = df.select("r.*")
df_flattened = df_expanded.select(flatten_struct(df_expanded.schema))
df_flattened.show()
+---+-----+-----+
| a|b.bar|b.foo|
+---+-----+-----+
| 1| 12| b|
+---+-----+-----+
Run Code Online (Sandbox Code Playgroud)
Psi*_*dom 11
您可以选择data.bar.baz为bar.baz:
df.show()
+-------+
| data|
+-------+
|[3,[2]]|
+-------+
df.printSchema()
root
|-- data: struct (nullable = false)
| |-- foo: long (nullable = true)
| |-- bar: struct (nullable = false)
| | |-- baz: long (nullable = true)
Run Code Online (Sandbox Code Playgroud)
在pyspark:
import pyspark.sql.functions as F
df.select(F.col("data.foo").alias("foo"), F.col("data.bar.baz").alias("bar.baz")).show()
+---+-------+
|foo|bar.baz|
+---+-------+
| 3| 2|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4923 次 |
| 最近记录: |