如何有效地从 Spark 中的每一列中查找不同的值

Question

如何有效地从 Spark 中的每一列中查找不同的值

Akh*_*laV 5 performance scala apache-spark

Array为了从我尝试过的每一列中找到不同的值

RDD[Array[String]].map(_.map(Set(_))).reduce { 
(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}}

Run Code Online (Sandbox Code Playgroud)

执行成功。但是，我想知道是否有比上面的示例代码更有效的方法。谢谢。

Answer 1

The*_*aul 4

聚合节省了一个步骤，可能会也可能不会更有效

val z = Array.fill(5)(Set[String]()) // or whatever the length is
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).map { case (x, y) => x + y}}, 
                          {(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}})

Run Code Online (Sandbox Code Playgroud)

您还可以尝试使用可变集并进行修改，而不是在每一步生成新的集（Spark 明确允许）：

val z = Array.fill(5)(scala.collection.mutable.Set[String]())
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).foreach { case (x, y) => x+= y };a},
                          {(a, b) => (a.zip(b)).foreach { case (x, y) => x ++= y};a})

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	2346 次
最近记录：	10 年，5 月前