使用 Scala 和 Spark 将日期列转换为年龄

gio*_*sis -1 scala apache-spark

我正在尝试将数据集的列转换为真实年龄。我正在使用 Scala 和 Spark,我的项目在 IntelliJ 上。

这是示例数据集

TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
7000|1957-03-06|Female|3|Beauty
8000|1959-03-06|Male|4|Car 
Run Code Online (Sandbox Code Playgroud)

这是Scala的代码

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile2 {

def main(args:Array[String]):Unit= {

val spark: SparkSession = SparkSession.builder()
  .master("local[1]")
  .appName("SparkByExample")
  .getOrCreate()

val filePath="src/main/resources/demodata.txt"

val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

val df2 = df
  .filter("Gender is not null")
  .filter("BirthDate is not null")
  .filter("TotalChildren is not null")
  .filter("ProductCategoryName is not null")
df2.show()
Run Code Online (Sandbox Code Playgroud)

所以我试图将列中的 1957-03-06 转换为像 61 这样的年龄

任何想法都会有很大帮助

非常感谢

sta*_*106 5

您可以使用内置函数 - Months_ Between() 或 datediff()。看一下这个

scala> val df = Seq("1957-03-06","1959-03-06").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]

scala> df.show(false)
+----------+
|date      |
+----------+
|1957-03-06|
|1959-03-06|
+----------+

scala> df.withColumn("age",months_between(current_date,'date)/12).show
+----------+------------------+
|      date|               age|
+----------+------------------+
|1957-03-06|61.806451612500005|
|1959-03-06|59.806451612500005|
+----------+------------------+

scala> df.withColumn("age",datediff(current_date,'date)/365).show
+----------+-----------------+
|      date|              age|
+----------+-----------------+
|1957-03-06|61.85205479452055|
|1959-03-06|59.85205479452055|
+----------+-----------------+


scala>
Run Code Online (Sandbox Code Playgroud)