如何在spark sql中调用具有多个参数（柯里化）的udf？

Question

如何在spark sql中调用具有多个参数（柯里化）的udf？

我如何在 Spark 数据框中使用多个参数（柯里化）调用下面的 UDF，如下所示。

读取读取并获取列表[String]

val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList

Run Code Online (Sandbox Code Playgroud)

注册UDF

val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))

Run Code Online (Sandbox Code Playgroud)

在下面的df中调用udf

df.withColumn("value",
     getValue(df("id"),
        df("string1"),
        df("string2"))).show()

Run Code Online (Sandbox Code Playgroud)

这是我缺少的List[String]论点，我真的不确定我应该如何传递这个论点。

Answer 1

had*_*per 5

我可以根据您的问题对您的要求做出以下假设

a] UDF应该接受除数据框列之外的参数

b] UDF 应采用多列作为参数

假设您想要将所有列中的值与指定参数连接起来。以下是您可以如何做到的

import org.apache.spark.sql.functions._

def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))

val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")

scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
|  1|r1c1|r1c2|
|  2|r2c1|r2c2|
+---+----+----+

val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))



   scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col                  |
+---+----+----+-------------------------+
|1  |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2  |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，5 月前
查看次数：	12950 次
最近记录：	4 年，10 月前