如何在数据集上将集合分组为运算符/方法?

Jih*_* No 5 dataframe apache-spark apache-spark-sql

Spark Scala中是否没有功能级别的grouping_sets支持?

我不知道这个补丁适用于大师版 https://github.com/apache/spark/pull/5080

我想通过scala dataframe api进行这种查询。

GROUP BY expression list GROUPING SETS(expression list2)
Run Code Online (Sandbox Code Playgroud)

cuberollup 功能在Dataset API中可用,但找不到分组集。为什么?

Jac*_*ski 7

我想通过 scala 数据框 api 进行这种查询。

tl;dr到 Spark 2.1.0,这是不可能的。目前没有计划将此类运算符添加到 Dataset API。

Spark SQL 支持以下所谓的多维聚合运算符

  • rollup 操作员
  • cube 操作员
  • GROUPING SETS 子句(仅在 SQL 模式下)
  • grouping()grouping_id()功能

注意:GROUPING SETS仅在 SQL 模式下可用。数据集 API 中不支持。

分组集

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+
Run Code Online (Sandbox Code Playgroud)