如何在 PySpark 中将两列堆叠成一列?

Flu*_*uxy 3 python pyspark pyspark-dataframes

我有以下 PySpark DataFrame:

id   col1   col2
A    2      3
A    2      4
A    4      6
B    1      2
Run Code Online (Sandbox Code Playgroud)

我想堆叠col1col2获得如下单列:

id   col3
A    2   
A    3
A    4
A    6
B    1
B    2
Run Code Online (Sandbox Code Playgroud)

我怎么能这样做?

df = (
    sc.parallelize([
        (A, 2, 3), (A, 2, 4), (A, 4, 6),
        (B, 1, 2),
    ]).toDF(["id", "col1", "col2"])
)
Run Code Online (Sandbox Code Playgroud)

Psi*_*dom 5

The simplest is merge col1 and col2 into an array column and then explode it:

df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df.selectExpr('id', 'explode(array(col1, col2))').show()
+---+---+
| id|col|
+---+---+
|  A|  2|
|  A|  3|
|  A|  2|
|  A|  4|
|  A|  4|
|  A|  6|
|  B|  1|
|  B|  2|
+---+---+
Run Code Online (Sandbox Code Playgroud)

You can drop duplicates if you don't need them.