Dzm*_*kov 12 scala apache-spark apache-spark-sql apache-spark-1.6
在Spark 1.6.0/Scala中,是否有机会获得collect_list("colC")或collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
Ram*_*jan 22
既然你有dataframe作为
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
Run Code Online (Sandbox Code Playgroud)
您可以Window执行以下操作
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Run Code Online (Sandbox Code Playgroud)
结果:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Run Code Online (Sandbox Code Playgroud)
类似的结果collect_set也是如此.但是决赛中元素的顺序set不符合collect_list
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
Run Code Online (Sandbox Code Playgroud)
如果您删除orderBy如下
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
Run Code Online (Sandbox Code Playgroud)
结果会是
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Run Code Online (Sandbox Code Playgroud)
我希望答案是有帮助的
| 归档时间: |
|
| 查看次数: |
26532 次 |
| 最近记录: |