Yoo*_*hoe 6 scala cube dataframe apache-spark apache-spark-sql
我正在使用Spark 1.6.1,我有这样的数据帧.
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
| scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
| test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test|
......
Run Code Online (Sandbox Code Playgroud)
我想通过使用立方体操作获得如下数据帧.
(由所有字段分组,但只有"os_name","country","app_ver"字段为立方体)
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
| scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
| test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| 9|
| test_home|scene_enter| test_home| null| KR| 5.6.3|__OTHERS__| false| test| test| test| 35|
| test_home|scene_enter| test_home|android| null| 5.6.3|__OTHERS__| false| test| test| test| 98|
| test_home|scene_enter| test_home|android| KR| null|__OTHERS__| false| test| test| test|101|
| test_home|scene_enter| test_home| null| null| 5.6.3|__OTHERS__| false| test| test| test|301|
| test_home|scene_enter| test_home| null| KR| null|__OTHERS__| false| test| test| test|225|
| test_home|scene_enter| test_home|android| null| null|__OTHERS__| false| test| test| test|312|
| test_home|scene_enter| test_home| null| null| null|__OTHERS__| false| test| test| test|521|
......
Run Code Online (Sandbox Code Playgroud)
我尝试过如下,但它看起来很慢而且丑陋......
var cubed = df
.cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
.count
.where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")
Run Code Online (Sandbox Code Playgroud)
更好的解决方案?
我相信你无法完全避免这个问题,但有一个简单的技巧可以减少它的规模。这个想法是用一个占位符替换所有不应被边缘化的列。
例如,如果您有DataFrame:
val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")
Run Code Online (Sandbox Code Playgroud)
d并且您对边缘化的立方体和e分组感兴趣,a..c您可以将替代品定义为a..c:
import org.apache.spark.sql.functions.struct
import sparkSql.implicits._
// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")
Run Code Online (Sandbox Code Playgroud)
和cube:
val cubed = Seq($"d", $"e")
// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count
Run Code Online (Sandbox Code Playgroud)
快速过滤和选择应该处理剩下的事情:
tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)
Run Code Online (Sandbox Code Playgroud)
结果如下:
val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
721 次 |
| 最近记录: |