在 SparkSQL 中,如何从嵌套结构中选择列的子集,并使用 SQL 语句将其保留为结果中的嵌套结构?

Ale*_*ida 1 apache-spark-sql pyspark

我可以在 SparkSQL 中执行以下语句:

result_df = spark.sql("""select
    one_field,
    field_with_struct
  from purchases""")
Run Code Online (Sandbox Code Playgroud)

生成的数据帧将具有完整结构的字段field_with_struct

一个字段 带结构的字段
123 {名称1,val1,val2,f2,f4}
第555章 {名称2,val3,val4,f6,f7}

我只想从 中选择几个字段field_with_struct,但将它们保留在结果数据框中的结构中。如果有可能(这不是真正的代码):

result_df = spark.sql("""select
    one_field,
    struct(
      field_with_struct.name,
      field_with_struct.value2
    ) as my_subset
  from purchases""")
Run Code Online (Sandbox Code Playgroud)

要得到这个:

一个字段 我的子集
123 {名称1,值2}
第555章 {名称2,值4}

有没有办法用 SQL 来做到这一点?(不适用于流畅的 API)

Mik*_*eGM 6

有一个使用arrays_zip的更简单的解决方案,无需爆炸/收集列表(对于复杂数据,这可能容易出错/困难,因为它依赖于使用 id 列之类的东西):

>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs      |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
Run Code Online (Sandbox Code Playgroud)

编辑添加相应的 Spark SQL 代码,因为这是 OP 请求的:

>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
Run Code Online (Sandbox Code Playgroud)