Anu*_*hit 4 scala apache-spark
我有一个火花的df,如下结构:
amount gender status
1000 male married
1313 female single
1000 male married
Run Code Online (Sandbox Code Playgroud)
基本上我想创建一个性别是数字的新列
amount gender status gender_num
1000 male married 1
1313 female single 2
1000 male married 1
Run Code Online (Sandbox Code Playgroud)
我厌倦了以下几点:
val gender = df.gender
val gender_num = gender match {
case male => 1
case female => 2
}
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
<console>:125: error: value pa_gender_category is not a member of org.apache.spark.sql.DataFrame
val gender = data.pa_gender_category
Run Code Online (Sandbox Code Playgroud)
我知道有一个stringtoindex函数,但我想手动执行此操作
使用 withColumn
val input = // load input DataFrame
val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).otherwise(1))
Run Code Online (Sandbox Code Playgroud)
您可以链接更多选项:
val withGender = input.withColumn("gender_num", when($"gender" === "female", 2).when($"gender" == "other", 3).otherwise(1))
Run Code Online (Sandbox Code Playgroud)
你也可以在Akash的答案中使用UDF.请注意,有时UDF无法像内置函数那样进行优化,但它们可以更具可读性