Gof*_*tte 2 scala apache-spark
我有两列:一列为 Integer 类型,一列为 linalg.Vector 类型。我可以将 linalg.Vector 转换为数组。每个数组有 32 个元素。我想将数组中的每个元素转换为一列。所以输入就像:
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Run Code Online (Sandbox Code Playgroud)
输出应该是:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Run Code Online (Sandbox Code Playgroud)
但在我的例子中,数组中有 32 个元素。在spark scala中使用有问题的Convert Array of String column to multiple columns的方法太多了
我尝试了几种方法,但都不起作用。这样做的正确方法是什么?
多谢。
scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]
scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column
scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)
scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
| 3| 5| 25| 3|
| 2| 7| 15| 4|
| 1| 10| 12| 2|
+---------+---------+---------+-------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7747 次 |
| 最近记录: |