如何在 PySpark 中将两列堆叠成一列？

Question

如何在 PySpark 中将两列堆叠成一列？

Flu*_*uxy 3 python pyspark pyspark-dataframes

我有以下 PySpark DataFrame：

id   col1   col2
A    2      3
A    2      4
A    4      6
B    1      2

Run Code Online (Sandbox Code Playgroud)

我想堆叠col1并col2获得如下单列：

id   col3
A    2   
A    3
A    4
A    6
B    1
B    2

Run Code Online (Sandbox Code Playgroud)

我怎么能这样做？

df = (
    sc.parallelize([
        (A, 2, 3), (A, 2, 4), (A, 4, 6),
        (B, 1, 2),
    ]).toDF(["id", "col1", "col2"])
)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Psi*_*dom 5

The simplest is merge col1 and col2 into an array column and then explode it:

df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df.selectExpr('id', 'explode(array(col1, col2))').show()
+---+---+
| id|col|
+---+---+
|  A|  2|
|  A|  3|
|  A|  2|
|  A|  4|
|  A|  4|
|  A|  6|
|  B|  1|
|  B|  2|
+---+---+

Run Code Online (Sandbox Code Playgroud)

You can drop duplicates if you don't need them.

归档时间：	5 年，11 月前
查看次数：	2460 次
最近记录：	5 年，11 月前