Bry*_*ind 3 apache-spark-sql pyspark spark-dataframe pyspark-sql
我在 spark 数据框中有一列列表。
+-----------------+
|features |
+-----------------+
|[0,45,63,0,0,0,0]|
|[0,0,0,85,0,69,0]|
|[0,89,56,0,0,0,0]|
+-----------------+
Run Code Online (Sandbox Code Playgroud)
如何将其转换为 spark 数据帧,其中列表中的每个元素都是数据帧中的一列?我们可以假设列表的大小相同。
例如,
+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+
Run Code Online (Sandbox Code Playgroud)
你所描述的实际上是VectorAssembler操作的反转。
您可以通过转换为中间 RDD 来实现,如下所示:
spark.version
# u'2.2.0'
# your data:
df.show(truncate=False)
# +-----------------+
# | features |
# +-----------------+
# |[0,45,63,0,0,0,0]|
# |[0,0,0,85,0,69,0]|
# |[0,89,56,0,0,0,0]|
# +-----------------+
dimensionality = 7
out = df.rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['c'+str(i+1) for i in range(dimensionality)])
out.show()
# +---+----+----+----+---+----+---+
# | c1| c2| c3| c4| c5| c6| c7|
# +---+----+----+----+---+----+---+
# |0.0|45.0|63.0| 0.0|0.0| 0.0|0.0|
# |0.0| 0.0| 0.0|85.0|0.0|69.0|0.0|
# |0.0|89.0|56.0| 0.0|0.0| 0.0|0.0|
# +---+----+----+----+---+----+---+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2390 次 |
| 最近记录: |