我目前有代码,我通过多个.withColumn链重复将相同的过程应用于多个DataFrame列,并且我想创建一个简化过程的函数.在我的情况下,我发现由键聚合的列的累积总和:
val newDF = oldDF
.withColumn("cumA", sum("A").over(Window.partitionBy("ID").orderBy("time")))
.withColumn("cumB", sum("B").over(Window.partitionBy("ID").orderBy("time")))
.withColumn("cumC", sum("C").over(Window.partitionBy("ID").orderBy("time")))
//.withColumn(...)
Run Code Online (Sandbox Code Playgroud)
我想要的是:
def createCumulativeColums(cols: Array[String], df: DataFrame): DataFrame = {
// Implement the above cumulative sums, partitioning, and ordering
}
Run Code Online (Sandbox Code Playgroud)
或者更好的是:
def withColumns(cols: Array[String], df: DataFrame, f: function): DataFrame = {
// Implement a udf/arbitrary function on all the specified columns
}
Run Code Online (Sandbox Code Playgroud) scala user-defined-functions dataframe apache-spark apache-spark-sql