laz*_*der 1 python arrays struct apache-spark pyspark
我有一个包含 3 列的数据框
| str1 | array_of_str1 | array_of_str2 |
+-----------+----------------------+----------------+
| John | [Size, Color] | [M, Black] |
| Tom | [Size, Color] | [L, White] |
| Matteo | [Size, Color] | [M, Red] |
Run Code Online (Sandbox Code Playgroud)
我想添加包含结构类型中 3 列的 Array 列
| str1 | array_of_str1 | array_of_str2 | concat_result |
+-----------+----------------------+----------------+-----------------------------------------------+
| John | [Size, Color] | [M, Black] | [[[John, Size , M], [John, Color, Black]]] |
| Tom | [Size, Color] | [L, White] | [[[Tom, Size , L], [Tom, Color, White]]] |
| Matteo | [Size, Color] | [M, Red] | [[[Matteo, Size , M], [Matteo, Color, Red]]] |
Run Code Online (Sandbox Code Playgroud)
如果数组中的元素数量是固定的,那么使用array和struct函数就非常简单了。这是 Scala 中的一些代码。
val result = df
.withColumn("concat_result", array((0 to 1).map(i => struct(
col("str1"),
col("array_of_str1").getItem(i),
col("array_of_str2").getItem(i)
)) : _*))
Run Code Online (Sandbox Code Playgroud)
在 python 中,因为你问的是 pyspark:
import pyspark.sql.functions as F
df.withColumn("concat_result", F.array(*[ F.struct(
F.col("str1"),
F.col("array_of_str1").getItem(i),
F.col("array_of_str2").getItem(i))
for i in range(2)]))
Run Code Online (Sandbox Code Playgroud)
您将获得以下架构:
root
|-- str1: string (nullable = true)
|-- array_of_str1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- array_of_str2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- concat_result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- str1: string (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6769 次 |
| 最近记录: |