如何计算单个组中的总和和计数?

ulr*_*ich 14 scala apache-spark apache-spark-sql

基于以下内容DataFrame:

val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
|  1|    A|  10|
|  2|    A|   5|
|  3|    B|  56|
+---+-----+----+
Run Code Online (Sandbox Code Playgroud)

我想按类别获取ID和总金额:

+-----+-----+---------+
|Categ|count|sum(Amnt)|
+-----+-----+---------+
|    B|    1|       56|
|    A|    2|       15|
+-----+-----+---------+
Run Code Online (Sandbox Code Playgroud)

是否可以在不进行连接的情况下进行计数和总和?

client.groupBy("Categ").count
      .join(client.withColumnRenamed("Categ","cat")
           .groupBy("cat")
           .sum("Amnt"), 'Categ === 'cat)
      .drop("cat")
Run Code Online (Sandbox Code Playgroud)

也许是这样的:

client.createOrReplaceTempView("client")
spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show()
Run Code Online (Sandbox Code Playgroud)

Ram*_*ram 21

我给的是不同于你的例子

像这样可以实现多组功能.相应地尝试一下

  // In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg($"department", max("age"), sum("expense"))

// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"))
Run Code Online (Sandbox Code Playgroud)
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

val spark: SparkSession = SparkSession
      .builder.master("local")
      .appName("MyGroup")
      .getOrCreate()
import spark.implicits._
    val client: DataFrame = spark.sparkContext.parallelize(
Seq((1,"A",10),(2,"A",5),(3,"B",56))
).toDF("ID","Categ","Amnt")

client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
Run Code Online (Sandbox Code Playgroud)
+-----+---------+---------+
|Categ|sum(Amnt)|count(ID)|
+-----+---------+---------+
|    B|       56|        1|
|    A|       15|        2|
+-----+---------+---------+
Run Code Online (Sandbox Code Playgroud)

  • 亲爱的downvoters!如果有什么需要,请提一下改进的原因. (3认同)

小智 9

你可以在给定的表上进行如下聚合:

client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()

+-----+---------+---------+
|Categ|sum(Amnt)|count(ID)|
+-----+---------+---------+
|    A|       15|        2|
|    B|       56|        1|
+-----+---------+---------+
Run Code Online (Sandbox Code Playgroud)