爆炸功能的反作用

Question

爆炸功能的反作用

tic*_*pix 0 scala apache-spark apache-spark-sql

在带有spark-2.4的scala中，我想过滤列中数组内部的值。

从

+---+------------+
| id|      letter|
+---+------------+
|  1|[x, xxx, xx]|
|  2|[yy, y, yyy]|
+---+------------+

Run Code Online (Sandbox Code Playgroud)

至

+---+-------+
| id| letter|
+---+-------+
|  1|[x, xx]|
|  2|[yy, y]|
+---+-------+

Run Code Online (Sandbox Code Playgroud)

我想到了使用explode+filter

val res = Seq(("1", Array("x", "xxx", "xx")), ("2", Array("yy", "y", "yyy"))).toDF("id", "letter")
res.withColumn("tmp", explode(col("letter"))).filter(length(col("tmp")) < 3).drop(col("letter")).show()

Run Code Online (Sandbox Code Playgroud)

我正在

+---+---+
| id|tmp|
+---+---+
|  1|  x|
|  1| xx|
|  2| yy|
|  2|  y|
+---+---+

Run Code Online (Sandbox Code Playgroud)

如何按ID zip / groupBy返回？

还是有更好，更优化的解决方案？

Answer 1

小智 6

您可以不explode()使用Spark 2.4 过滤数组：

res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show()

Run Code Online (Sandbox Code Playgroud)

输出：

+---+-------+
| id| letter|
+---+-------+
|  1|[x, xx]|
|  2|[yy, y]|
+---+-------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，7 月前
查看次数：	90 次
最近记录：	6 年，7 月前