如何在 Pyspark 中按元素连接两个 ArrayType(StringType()) 列？

Question

如何在 Pyspark 中按元素连接两个 ArrayType(StringType()) 列？

ARC*_*row 1 apache-spark apache-spark-sql pyspark

我ArrayType(StringType())在 Spark 数据框中有两列，我想按元素连接这两列：

输入：

+-------------+-------------+
|col1         |col2         |
+-------------+-------------+
|['a','b']    |['c','d']    |
|['a','b','c']|['e','f','g']|
+-------------+-------------+

Run Code Online (Sandbox Code Playgroud)

输出：

+-------------+-------------+----------------+
|col1         |col2         |col3            |
+-------------+-------------+----------------+
|['a','b']    |['c','d']    |['ac', 'bd']    |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+

Run Code Online (Sandbox Code Playgroud)

谢谢。

Answer 1

bla*_*hop 8

对于 Spark 2.4+，您可以使用zip_with函数：

zip_with(left, right, func)- 使用函数将两个给定的数组按元素合并为一个数组

df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))")).show()

#+------+------+--------+
#|  col1|  col2|    col3|
#+------+------+--------+
#|[a, b]|[c, d]|[ac, bd]|
#+------+------+--------+

Run Code Online (Sandbox Code Playgroud)

使用transform这样的函数的另一种方法：

df.withColumn("col3", expr("transform(col1, (x, i) -> concat(x, col2[i]))"))

Run Code Online (Sandbox Code Playgroud)

该transform函数将第一个数组列作为参数col1，迭代其元素并应用 lambda 函数(x, i) -> concat(x, col2[i])，其中x实际元素及其i索引用于从数组中获取相应的元素col2。

归档时间：	6 年，1 月前
查看次数：	2591 次
最近记录：	4 年，1 月前