如何将列表列与 pyspark dataframe 列相交？

Question

如何将列表列与 pyspark dataframe 列相交？

我有一个下面的 pyspark 数据框，我需要创建新列 (new_col)，它是 X 列和 Y 列中的常见项目，不包括 Z 中的项目。

df

id X             Y                    Z            new_col
1 [12,23,1,24]  [13,412,12,23,24]     [12]         [23,24]
2 [1,2,3]       [2,4,5,6]             []           [2]

Run Code Online (Sandbox Code Playgroud)

Answer 1

pau*_*ult 8

如果您的架构如下：

df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# |    |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# |    |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# |    |-- element: long (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

您的 pyspark 版本 2.4+ 您可以使用array_intersect和array_except：

from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X              |Y                    |Z   |new_col |
#+---+---------------+---------------------+----+--------+
#|1  |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2  |[1, 2, 3]      |[2, 4, 5, 6]         |[]  |[2]     |
#+---+---------------+---------------------+----+--------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，4 月前
查看次数：	5505 次
最近记录：	6 年，4 月前