Pyspark将结构数组转换为字符串

Question

Pyspark将结构数组转换为字符串

Fra*_*sYL 3 python apache-spark-sql pyspark

我在 Pyspark 中有以下数据框

+----+-------+-----+                                                            
|name|subject|score|
+----+-------+-----+
| Tom|   math|   90|
| Tom|physics|   70|
| Amy|   math|   95|
+----+-------+-----+

Run Code Online (Sandbox Code Playgroud)

我使用collect_list并struct从pyspark.sql.functions

df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))

Run Code Online (Sandbox Code Playgroud)

获取以下数据框

+----+--------------------+
|name|          score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy|        [[math, 95]]|
+----+--------------------+

Run Code Online (Sandbox Code Playgroud)

我的问题是如何将最后一列score_list转换为字符串并将其转储到 csv 文件中，如下所示

Tom     (math, 90) | (physics, 70)
Amy     (math, 95)

Run Code Online (Sandbox Code Playgroud)

感谢任何帮助，谢谢。

更新：这是一个类似的问题，但并不完全相同，因为它直接从string另一个string. 就我而言，我想首先转移string到collect_list<struct>并最终将其字符串化collect_list<struct>。

Answer 1

jxc*_*jxc 5

根据您的更新和评论，对于Spark 2.4.0+，这是使用 Spark SQL 内置函数对结构数组进行字符串化的一种方法：transform和array_join：

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- score_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- subject: string (nullable = true)
 |    |    |-- score: integer (nullable = true)

>>> df.show(2,0)
+----+---------------------------+
|name|score_list                 |
+----+---------------------------+
|Tom |[[math, 90], [physics, 70]]|
|Amy |[[math, 95]]               |
+----+---------------------------+

>>> df1.selectExpr(
        "name"
      , """
         array_join(
             transform(score_list, x -> concat('(', x.subject, ', ', x.score, ')'))
           , ' | '
         ) AS score_list
        """
).show(2,0)

+----+--------------------------+
|name|score_list                |
+----+--------------------------+
|Tom |(math, 90) | (physics, 70)|
|Amy |(math, 95)                |
+----+--------------------------+

Run Code Online (Sandbox Code Playgroud)

在哪里：

使用transform()将结构数组转换为字符串数组。对于每个数组元素（结构体x），我们使用concat('(', x.subject, ', ', x.score, ')')将其转换为字符串。
使用array_join()将所有数组元素(StringType) 与连接起来|，这将返回最终字符串

归档时间：	6 年，6 月前
查看次数：	1961 次
最近记录：	6 年，3 月前