Spark,在Scala中添加具有相同值的新列

Ale*_*dro 11 scala apache-spark spark-dataframe

withColumn在Spark-Scala环境中遇到了一些问题.我想在我的DataFrame中添加一个新列,如下所示:

+---+----+---+
|  A|   B|  C|
+---+----+---+
|  4|blah|  2|
|  2|    |  3|
| 56| foo|  3|
|100|null|  5|
+---+----+---+
Run Code Online (Sandbox Code Playgroud)

成为:

+---+----+---+-----+
|  A|   B|  C|  D  |
+---+----+---+-----+
|  4|blah|  2|  750|
|  2|    |  3|  750|
| 56| foo|  3|  750|
|100|null|  5|  750|
+---+----+---+-----+
Run Code Online (Sandbox Code Playgroud)

对于我的DataFrame中的每一行,一列中的列D重复N次.

代码是这样的:

var totVehicles : Double = df_totVehicles(0).getDouble(0); //return 750
Run Code Online (Sandbox Code Playgroud)

变量totVehicles返回正确的值,它的工作原理!

第二个DataFrame必须计算2个字段(id_zipcode,n_vehicles),并添加第三列(具有相同的值-750):

var df_nVehicles =
df_carPark.filter(
      substring($"id_time",1,4) < 2013
    ).groupBy(
      $"id_zipcode"
    ).agg(
      sum($"n_vehicles") as 'n_vehicles
    ).select(
      $"id_zipcode" as 'id_zipcode,
      'n_vehicles
    ).orderBy(
      'id_zipcode,
      'n_vehicles
    );
Run Code Online (Sandbox Code Playgroud)

最后,我添加了具有以下withColumn功能的新列:

var df_nVehicles2 = df_nVehicles.withColumn(totVehicles, df_nVehicles("n_vehicles") + df_nVehicles("id_zipcode"))
Run Code Online (Sandbox Code Playgroud)

但Spark告诉我这个错误:

 error: value withColumn is not a member of Unit
         var df_nVehicles2 = df_nVehicles.withColumn(totVehicles, df_nVehicles("n_vehicles") + df_nVehicles("id_zipcode"))
Run Code Online (Sandbox Code Playgroud)

你能帮助我吗?非常感谢你!

Roc*_*ang 34

lit function用于将文字值添加为列

import org.apache.spark.sql.functions._
df.withColumn("D", lit(750))
Run Code Online (Sandbox Code Playgroud)