Pyspark - 将多列数据组合成跨行分布的单列

Question

Pyspark - 将多列数据组合成跨行分布的单列

我有一个多列的 pyspark 数据框，如下所示：

name    col1    col2    col3
A        1        6       7
B        2        7       6
C        3        8       5
D        4        9       4
E        5        8       3

Run Code Online (Sandbox Code Playgroud)

我想通过将 col1、col2、col3 的列名和列值组合成两个新列，例如 new_col 和 new_col_val，跨行创建一个新的数据框：

我使用以下代码在 R 中做了同样的事情：

df1 <- gather(df,new_col,new_col_val,-name)

Run Code Online (Sandbox Code Playgroud)

我想创建 3 个单独的数据帧，它们将包含原始数据帧中的每一列，然后将它们附加在一起，但我的数据有超过 2500k 行和大约 60 列。创建多个数据框将是最糟糕的主意。谁能告诉我如何在 pyspark 中执行此操作？

Answer 1

小智 6

可以unionAll用来将列转换为行，lit也可以用来指定列名，如下图，

from pyspark.sql.functions import lit

df2 = df.select(df.columns[0], lit(df.columns[1]).alias('new_col'),
                df[df.columns[1]].alias('new_col_val'))

for i in df.columns[2:]:
    df2 = df2.unionAll(df.select(df.columns[0], lit(i), df[i]))

Run Code Online (Sandbox Code Playgroud)

输出：

+----+-------+-----------+
|name|new_col|new_col_val|
+----+-------+-----------+
|   A|   col1|          1|
|   B|   col1|          2|
|   C|   col1|          3|
|   D|   col1|          4|
|   E|   col1|          5|
|   A|   col2|          6|
|   B|   col2|          7|
|   C|   col2|          8|
|   D|   col2|          9|
|   E|   col2|          8|
|   A|   col3|          7|
|   B|   col3|          6|
|   C|   col3|          5|
|   D|   col3|          4|
|   E|   col3|          3|
+----+-------+-----------+

Run Code Online (Sandbox Code Playgroud)

注意：所有列必须具有相同的数据类型。

要检查列是否具有相同的数据类型，

if len(set(map(lambda x: x[-1], df.dtypes[1:]))) != 1:
    raise AssertionError("All columns must be of the same datatype")

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，11 月前
查看次数：	1974 次
最近记录：	5 年，11 月前