pyspark-合并2列集合

sou*_*ess 2 apache-spark pyspark pyspark-sql

我有一个火花数据框,它具有由collect_set函数形成的2列。我想将这2列集合合并为1列集合。我应该怎么做?他们都是套弦

对于实例,我通过调用collect_set形成了2列

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]
Run Code Online (Sandbox Code Playgroud)

我如何将其变成:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]
Run Code Online (Sandbox Code Playgroud)

非常感谢您的提前帮助

Cze*_*ogy 5

我也在Python中弄清楚了这一点,因此这是Ramesh针对Python的解决方案的一部分:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
Run Code Online (Sandbox Code Playgroud)

输出:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

感谢Ramesh!


编辑:请注意,您可能必须手动指定列类型(不知道为什么只有在没有显式类型说明的情况下它才对我有用-在其他情况下,我得到的是字符串类型列)。

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Run Code Online (Sandbox Code Playgroud)


Pre*_*rem 0

假设df

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
Run Code Online (Sandbox Code Playgroud)

然后

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
Run Code Online (Sandbox Code Playgroud)

创建一组Fruits&Meat组合成一个集合,即

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
Run Code Online (Sandbox Code Playgroud)


希望这可以帮助!