Spark DataDrame中=== null和isNull之间的差异

Joh*_*ohn 18 sql scala dataframe apache-spark apache-spark-sql

当我们使用时,我对这种差异感到困惑

 df.filter(col("c1") === null) and df.filter(col("c1").isNull) 
Run Code Online (Sandbox Code Playgroud)

相同的数据帧我在=== null中得到计数,但在isNull中计数为零.请帮我理解其中的区别.谢谢

use*_*411 34

首先,null除非出于兼容性原因,否则不要在Scala代码中使用.

关于你的问题,这是一个简单的SQL.col("c1") === null被解释为c1 = NULL和,因为NULL标记未定义的值,结果是未定义的包括NULL其自身的任何值.

spark.sql("SELECT NULL = NULL").show
Run Code Online (Sandbox Code Playgroud)
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("SELECT NULL != NULL").show
Run Code Online (Sandbox Code Playgroud)
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("SELECT TRUE != NULL").show
Run Code Online (Sandbox Code Playgroud)
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("SELECT TRUE = NULL").show
Run Code Online (Sandbox Code Playgroud)
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+
Run Code Online (Sandbox Code Playgroud)

要检查的唯一有效方法NULL是:

中实现DataFrameDSL作为Column.isNullColumn.isNotNull分别.

注意:

对于NULL-safe比较使用IS DISTINCT/ IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
Run Code Online (Sandbox Code Playgroud)
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
Run Code Online (Sandbox Code Playgroud)
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+
Run Code Online (Sandbox Code Playgroud)

not(_ <=> _)/<=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
Run Code Online (Sandbox Code Playgroud)
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
Run Code Online (Sandbox Code Playgroud)
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
Run Code Online (Sandbox Code Playgroud)
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+
Run Code Online (Sandbox Code Playgroud)

DataFrame分别在SQL和DSL中.

相关:

在Apache Spark Join中包含空值