Flatmap和rdd同时保留其余条目

Mar*_*nau 1 scala apache-spark

我在spark工作,我有一个形式的Rdd:

(x_{11},x_{12}, x_{13}, Array(A_{1},A_{2},A_{3}))
(x_{21},x_{22}, x_{23}, Array(A_{1},A_{2}))
(x_{31},x_{32}, x_{33}, Array(A_{1}))
Run Code Online (Sandbox Code Playgroud)

我想在保持x值的同时展平Array值.我知道如果我只有数组,我可以做df.flatmap并获得每行一个数组元素,但我想做的是得到

(x_{11},x_{12}, x_{13}, A_{1})
(x_{11},x_{12}, x_{13}, A_{2})
(x_{11},x_{12}, x_{13}, A_{3})
(x_{21},x_{22}, x_{23}, A_{1})
(x_{21},x_{22}, x_{23}, A_{2})
(x_{31},x_{32}, x_{33}, A_{1})
Run Code Online (Sandbox Code Playgroud)

基本上我想要的是重复数组中每个项目的行.我怎样才能在Spark-Scala中执行此操作?

Tza*_*har 5

您可以使用flatMap,只需确保您传递的函数为列表中的所有值保留"前缀"列:

val input: RDD[(Int, Int, Int, Seq[String])] = sc.parallelize(Seq(
  (1, 2, 3, Seq("a", "b")),
  (5, 6, 7, Seq("c", "d", "e"))
))

val result: RDD[(Int, Int, Int, String)] = 
  input.flatMap { case (i1, i2, i3, list) => list.map(e => (i1, i2, i3, e)) }

/* result:
   (1,2,3,a)
   (1,2,3,b)
   (5,6,7,c)
   (5,6,7,d)
   (5,6,7,e)
*/ 
Run Code Online (Sandbox Code Playgroud)