获取火花数据帧中 ArrayType 列的不同元素

Mas*_*oei 5 scala spark-dataframe

我有一个包含 3 列的数据,名为idfeat1feat2feat1feat2是字符串数组的形式:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]
Run Code Online (Sandbox Code Playgroud)

我想获取每个特征列中不同元素的列表,因此输出将是:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
Run Code Online (Sandbox Code Playgroud)

在 Scala 中执行此操作的最佳方法是什么?

Psi*_*dom 6

在对每一列collect_set应用该explode函数以在每个单元格中取消嵌套数组元素后,您可以使用来查找相应列的不同值。假设您的数据框被称为df

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])
Run Code Online (Sandbox Code Playgroud)