Eri*_*ner 33 sql rollup cube apache-spark apache-spark-sql
问题几乎在标题中.我找不到有关差异的详细文档.
我注意到了一个区别,因为在交换cube和groupBy函数调用时,我会得到不同的结果.我注意到对于使用'cube'的结果,我在经常分组的表达式上得到了很多空值.
zer*_*323 69
这些不是以相同的方式工作.groupBy它只是GROUP BY标准SQL 中的子句的等价物.换一种说法
table.groupBy($"foo", $"bar")
Run Code Online (Sandbox Code Playgroud)
相当于:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
Run Code Online (Sandbox Code Playgroud)
cube等同于CUBE扩展到GROUP BY.它采用列列表并将聚合表达式应用于分组列的所有可能组合.假设你有这样的数据:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
Run Code Online (Sandbox Code Playgroud)
df.show
// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+
Run Code Online (Sandbox Code Playgroud)
并且您cube(x, y)使用count 计算聚合:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
Run Code Online (Sandbox Code Playgroud)
一个类似的函数cube是rollup从左到右计算层次小计:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+
Run Code Online (Sandbox Code Playgroud)
只是为了比较让我们看看普通的结果groupBy:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+
Run Code Online (Sandbox Code Playgroud)
总结一下:
GROUP BY每行只包含一次相应的摘要.随着GROUP BY CUBE(..)每一行被包含在它代表水平的每个组合的总结,包含通配符.从逻辑上讲,上面显示的内容相当于这样(假设我们可以使用NULL占位符):
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y, COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
Run Code Online (Sandbox Code Playgroud)与GROUP BY ROLLUP(...)类似CUBE但通过从左到右填充列来分层次地工作.
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
Run Code Online (Sandbox Code Playgroud)ROLLUP并且CUBE来自数据仓库扩展,因此如果您想更好地了解其工作原理,您还可以查看您喜欢的RDMBS的文档.例如PostgreSQL在9.5中都引入了这些,并且这些文档相对较好.
这个“家庭”中多了一位成员就能解释这一切—— GROUPING SETS。我们在 PySpark/Scala 中没有它,但它存在于 SQL API 中。
GROUPING SETS用于设计所需的任何分组组合。其他 ( cube, rollup, groupBy) 返回预定义的现有组合:
cube("id", "x", "y")将返回(),(id),(x),(y),(id, x),(id, y),(x, y),(id, x, y)。
(所有可能存在的组合。)
rollup("id", "x", "y")只会返回(), (id), (id, x), (id, x, y)。
(包括所提供序列的开头的组合。)
groupBy("id", "x", "y")只会返回(id, x, y)组合。
例子
输入 df:
df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")
Run Code Online (Sandbox Code Playgroud)
cube
df.cube("id", "x", "y").count()
Run Code Online (Sandbox Code Playgroud)
是相同的...
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")
Run Code Online (Sandbox Code Playgroud)
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null| 2| 3|
|null|null|null| 4|
| a|null| 2| 3|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null| 1| 1|
|null| foo|null| 2|
| a|null|null| 4|
|null|null| 1| 1|
|null| foo| 2| 1|
|null| foo| 1| 1|
| a| foo| 2| 1|
|null| bar|null| 2|
|null| bar| 2| 2|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
Run Code Online (Sandbox Code Playgroud)
rollup
df.rollup("id", "x", "y").count()
Run Code Online (Sandbox Code Playgroud)
是相同的...GROUPING SETS ((), (id), (id, x), (id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
--(x), <- (not used)
--(y), <- (not used)
(id, x),
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
Run Code Online (Sandbox Code Playgroud)
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null|null| 4|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null|null| 4|
| a| foo| 2| 1|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
Run Code Online (Sandbox Code Playgroud)
groupBy
df.groupBy("id", "x", "y").count()
Run Code Online (Sandbox Code Playgroud)
是相同的...GROUPING SETS ((id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
--(), <- (not used)
--(id), <- (not used)
--(x), <- (not used)
--(y), <- (not used)
--(id, x), <- (not used)
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
Run Code Online (Sandbox Code Playgroud)
+---+---+---+-----+
| id| x| y|count|
+---+---+---+-----+
| a|foo| 2| 1|
| a|foo| 1| 1|
| a|bar| 2| 2|
+---+---+---+-----+
Run Code Online (Sandbox Code Playgroud)
笔记。以上所有都返回现有的组合。在示例数据框中,没有 的行"id":"a", "x":"bar", "y":1。甚至cube不退货。为了获得所有可能的组合(存在或不存在),我们应该执行如下操作(crossJoin):
df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# | id| x| y|count|
# +----+----+----+-----+
# |null|null|null| 4|
# |null|null| 1| 1|
# |null|null| 2| 3|
# |null| bar|null| 2|
# |null| bar| 2| 2|
# |null| foo|null| 2|
# |null| foo| 1| 1|
# |null| foo| 2| 1|
# | a|null|null| 4|
# | a|null| 1| 1|
# | a|null| 2| 3|
# | a| bar|null| 2|
# | a| bar| 1| null|
# | a| bar| 2| 2|
# | a| foo|null| 2|
# | a| foo| 1| 1|
# | a| foo| 2| 1|
# +----+----+----+-----+
Run Code Online (Sandbox Code Playgroud)