nuw*_*uwa 4 scala apache-spark apache-spark-sql
I have the following situation: I have a dataframe with an 'array' as the schema. Now I want to get for each array, all lists of pairs and save it again in a dataframe. So for example:
This is the original dataframe:
+---------------+
| candidateList|
+---------------+
| [1, 2]|
| [2, 3, 4]|
| [1, 3, 5]|
|[1, 2, 3, 4, 5]|
|[1, 2, 3, 4, 5]|
+---------------+
Run Code Online (Sandbox Code Playgroud)
And that is how it have to look like after the computation:
+---------------+
| candidates |
+---------------+
| [1, 2]|
| [2, 3]|
| [2, 4]|
| [3, 4]|
| [1, 3]|
| [1, 5]|
| [3, 5]|
|and so on... |
+---------------+
Run Code Online (Sandbox Code Playgroud)
I really don't know how this is possible in spark, maybe someone has a tip for me.
Kind regards
您需要创建一个 UDF(用户定义函数)并将其与explode
函数一起使用。由于 Scala 集合的combinations
方法,UDF 本身很简单:
import scala.collection.mutable
import org.apache.spark.sql.functions._
import spark.implicits._
val pairsUdf = udf((arr: mutable.Seq[Int]) => arr.combinations(2).toArray)
val result = df.select(explode(pairsUdf($"candidateList")) as "candidates")
result.show(numRows = 8)
// +----------+
// |candidates|
// +----------+
// | [1, 2]|
// | [2, 3]|
// | [2, 4]|
// | [3, 4]|
// | [1, 3]|
// | [1, 5]|
// | [3, 5]|
// | [1, 2]|
// +----------+
Run Code Online (Sandbox Code Playgroud)