小编Sud*_*adi的帖子

将可空列作为参数传递给Spark SQL UDF

这是一个Spark UDF,我用它来计算使用少量列的值.

def spark_udf_func(s: String, i:Int): Boolean = { 
    // I'm returning true regardless of the parameters passed to it.
    true
}

val spark_udf = org.apache.spark.sql.functions.udf(spark_udf_func _)

val df = sc.parallelize(Array[(Option[String], Option[Int])](
  (Some("Rafferty"), Some(31)), 
  (null, Some(33)), 
  (Some("Heisenberg"), Some(33)),  
  (Some("Williams"), null)
)).toDF("LastName", "DepartmentID")

df.withColumn("valid", spark_udf(df.col("LastName"), df.col("DepartmentID"))).show()
Run Code Online (Sandbox Code Playgroud)
+----------+------------+-----+
|  LastName|DepartmentID|valid|
+----------+------------+-----+
|  Rafferty|          31| true|
|      null|          33| true|
|Heisenberg|          33| true|
|  Williams|        null| null|
+----------+------------+-----+
Run Code Online (Sandbox Code Playgroud)

任何人都可以解释为什么最后一行的列有效值为null?

当我检查了火花计划时,我能够发现该计划有一个案例条件,它说如果column2(DepartmentID)为null,则必须返回null.

== Physical Plan ==

*Project [_1#699 AS LastName#702, _2#700 AS DepartmentID#703, if (isnull(_2#700)) …
Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql

6
推荐指数
1
解决办法
1360
查看次数

标签 统计

apache-spark ×1

apache-spark-sql ×1