如何在Spark中一次对多个列进行聚合

Question

如何在Spark中一次对多个列进行聚合

我有一个包含多列的数据框.我希望按其中一个列进行分组,并将其他列聚合一次.假设该表有4列,cust_id,f1,f2,f3,我想通过cust_id进行分组,然后获得avg(f1),avg(f2)和avg(f3).该表将包含许多列.任何提示？

下面的代码是一个好的开始,但由于我有很多列,手动编写它们可能不是一个好主意.

df.groupBy("cust_id").agg(sum("f1"), sum("f2"), sum("f3"))

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dan*_*ula 7

也许您可以尝试使用列名称映射列表:

val groupCol = "cust_id"
val aggCols = (df.columns.toSet - groupCol).map(
  colName => avg(colName).as(colName + "_avg")
).toList

df.groupBy(groupCol).agg(aggCols.head, aggCols.tail: _*)

Run Code Online (Sandbox Code Playgroud)

或者,如果需要,您还可以匹配架构并根据类型构建聚合:

val aggCols = df.schema.collect {
  case StructField(colName, IntegerType, _, _) => avg(colName).as(colName + "_avg")
  case StructField(colName, StringType, _, _) => first(colName).as(colName + "_first")
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，4 月前
查看次数：	6897 次
最近记录：	9 年，4 月前