San*_*jay 2 scala apache-spark
我如何在 Spark 数据框中使用多个参数(柯里化)调用下面的 UDF,如下所示。
读取读取并获取列表[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
Run Code Online (Sandbox Code Playgroud)
注册UDF
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
Run Code Online (Sandbox Code Playgroud)
在下面的df中调用udf
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Run Code Online (Sandbox Code Playgroud)
这是我缺少的List[String]论点,我真的不确定我应该如何传递这个论点。
我可以根据您的问题对您的要求做出以下假设
a] UDF应该接受除数据框列之外的参数
b] UDF 应采用多列作为参数
假设您想要将所有列中的值与指定参数连接起来。以下是您可以如何做到的
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+
Run Code Online (Sandbox Code Playgroud)