将无效数据设置为 Spark DataFrames 中的缺失数据

Question

将无效数据设置为 Spark DataFrames 中的缺失数据

Bor*_*ris 2 scala user-defined-functions missing-data dataframe apache-spark

让 x 是定义为（在 Scala 中）的两列字符串的数据框

case class Pair(X: String, Y: String)

val x = sqlContext.createDataFrame(Seq(
   Pair("u1", "1"), 
   Pair("u2", "wrong value"), 
   Pair("u3", "5"), 
   Pair("u4", "2")
))

Run Code Online (Sandbox Code Playgroud)

我想清理这个数据框，使第二列的每个值都是

如果可能，转换为 Int
替换为 null、Na 或任何表示“缺失值”的符号（不是 NaN，这是不同的）

我在考虑使用 udf 函数

val stringToInt = udf[Int, String](x => try {
     x.toInt
   } catch {
     case e: Exception => null
   })

x.withColumn("Y", stringToInt(x("Y")))

Run Code Online (Sandbox Code Playgroud)

...但 null 不是字符串，编译器拒绝它。请问有什么解决办法？只要我可以清理我的数据框，完全不同的方法也可以

Answer 1

zer*_*323 5

实际上，在这种特殊情况下，不需要 UDF。相反，您可以安全地使用Column.cast方法：

import org.apache.spark.sql.types.IntegerType
val clean = x.withColumn("Y", $"Y".cast(IntegerType)) // or .cast("integer")

clean.where($"Y".isNotNull).show
// +---+---+
// |  X|  Y|
// +---+---+
// | u1|  1|
// | u3|  5|
// | u4|  2|
// +---+---+

clean.where($"Y".isNull).show
// +---+----+
// |  X|   Y|
// +---+----+
// | u2|null|
// +---+----+

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	1825 次
最近记录：	6 年，12 月前