Spark,Scala - 列类型确定

J.D*_*one 7 scala apache-spark

我可以从数据库加载数据,我用这些数据做一些处理.问题是某些表的日期列为'String',但其他一些表将其特征为'timestamp'.

在加载数据之前,我无法知道什么类型的日期列.

> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
Run Code Online (Sandbox Code Playgroud)

这是我从spark加载数据的方式.

spark.read
              .format("jdbc")
              .option("url", url)
              .option("dbtable", table)
              .option("user", user)
              .option("password", password)
              .load()
Run Code Online (Sandbox Code Playgroud)

有什么方法可以将它们混合在一起吗?或者将其转换为字符串总是?

Tza*_*har 13

您可以对列的类型进行模式匹配(使用DataFrame的模式)来决定是将String解析为时间戳还是仅使用Timestamp - 并使用该unix_timestamp函数进行实际转换:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType

// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
  ("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
  ("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")

// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
  df.schema("date").dataType match {
    case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
    case _ => df
  }
}

// after "normalizing", you can assume date has Timestamp type - 
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
Run Code Online (Sandbox Code Playgroud)