Spark将csv列中的空值视为空数据类型

Question

Spark将csv列中的空值视为空数据类型

ttu*_*ner 5 apache-spark-sql spark-dataframe

我的spark应用程序读取一个csv文件，使用sql将其转换为其他格式，然后将结果数据帧写入另一个csv文件中。

例如，我输入csv如下：

Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234

Run Code Online (Sandbox Code Playgroud)

我的转换是：

Select Id, 
       FirstName, 
       LastName, 
       LocationId as PrimaryLocationId,
       null as SecondaryLocationId
from Input

Run Code Online (Sandbox Code Playgroud)

（我不能回答为什么空被用作SecondaryLocationId，它是业务用例）现在火花想不通SecondaryLocationId的数据类型，并返回在架构空和引发错误CSV数据源不支持空数据在写入输出csv时键入。

以下是printSchema（）和我正在使用的写入选项。

root
     |-- Id: string (nullable = true)
     |-- FirstName: string (nullable = true)
     |-- LastName: string (nullable = true)
     |-- PrimaryLocationId: string (nullable = false)
     |-- SecondaryLocationId: null (nullable = true)

dataFrame.repartition(1).write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .option("delimiter", "|")
      .option("nullValue", "")
      .option("inferSchema", "true")
      .csv(outputPath)

Run Code Online (Sandbox Code Playgroud)

有没有一种方法可以默认为数据类型（例如字符串）？顺便说一句，我可以通过用空string（''）替换null来使其工作，但这不是我想要的。

Answer 1

vaq*_*han 5

使用 lit(null): 导入 org.apache.spark.sql.functions.{lit, udf}

例子：

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))


scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv writer. If it is a hard requirement you 
 can cast column to the specific type (lets say String):

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

Run Code Online (Sandbox Code Playgroud)

或使用这样的 UDF：

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

重新发布 zero323 代码。

现在让我们讨论你的第二个问题

题：

“这只是当我知道哪些列将被视为空数据类型时。当读取大量文件并对其应用各种转换时，我不知道或有没有办法知道哪些字段是空处理的？”

答：

在这种情况下，您可以使用选项

Databricks Scala 风格指南不同意始终在 Scala 代码中禁止 null 并说：“对于性能敏感的代码，优先使用 null 而不是 Option，以避免虚拟方法调用和装箱。”

例子：

+------+
|number|
+------+
|     1|
|     8|
|    12|
|  null|
+------+


val actualDf = sourceDf.withColumn(
  "is_even",
  when(
    col("number").isNotNull, 
    isEvenSimpleUdf(col("number"))
  ).otherwise(lit(null))
)

actualDf.show()
+------+-------+
|number|is_even|
+------+-------+
|     1|  false|
|     8|   true|
|    12|   true|
|  null|   null|
+------+-------+

Run Code Online (Sandbox Code Playgroud)

只有当我知道哪些列将被视为空数据类型时才会这样做。当读取大量文件并对其应用各种转换时，我不知道或者有没有办法知道哪些字段是空处理的？ (2认同)

归档时间：	8 年，5 月前
查看次数：	4046 次
最近记录：	8 年，5 月前