在pyspark中将列的内容拆分为行

Question

在pyspark中将列的内容拆分为行

我有一个数据框 df：

+------+----------+--------------------+
|SiteID| LastRecID|        Col_to_split|
+------+----------+--------------------+
|     2|1056962584|[214, 207, 206, 205]|
|     2|1056967423|          [213, 208]|
|     2|1056870114|     [213, 202, 199]|
|     2|1056876861|[203, 213, 212, 1...|

Run Code Online (Sandbox Code Playgroud)

我想将列分成这样的行：

+----------+-------------+-------------+
|     RecID|        index|        Value|
+----------+-------------+-------------+
|1056962584|            0|          214|
|1056962584|            1|          207|
|1056962584|            2|          206|
|1056962584|            3|          205|
|1056967423|            0|          213|
|1056967423|            1|          208|
|1056870114|            0|          213|
|1056870114|            1|          202|
|1056870114|            2|          199|
|1056876861|            0|          203|
|1056876861|            1|          213|
|1056876861|            2|          212|
|1056876861|            3|          1..|
|1056876861|       etc...|       etc...|

Run Code Online (Sandbox Code Playgroud)

值包含列表中的值。Index 包含列表中值的索引。

我怎样才能使用 PySpark 做到这一点？

Answer 1

Psi*_*dom 6

从 Spark 2.1.0 开始，您可以使用posexplode取消嵌套数组列并输出每个元素的索引（使用来自 @Herve 的数据）：

import pyspark.sql.functions as F
df.select(
    F.col("LastRecID").alias("RecID"), 
    F.posexplode(F.col("coltosplit")).alias("index", "value")
).show()
+-----+-----+-----+
|RecID|index|value|
+-----+-----+-----+
|10526|    0|  214|
|10526|    1|  207|
|10526|    2|  206|
|10526|    3|  205|
|10896|    0|  213|
|10896|    1|  208|
+-----+-----+-----+

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	1951 次
最近记录：	7 年，9 月前