如何在 pyspark 中转置数据框?

use*_*753 1 pyspark

我在 pyspark 中找不到任何用于转置数据帧的函数。

Cal   Cal2   Cal3
'A'    12     11
'U'    10     9
'O'     5     5
'ER'    6     5


  
Cal    'A'   'U'   'O'   'ER'  
Cal2    12    10    5     6    
Cal3    11     9    5     5
Run Code Online (Sandbox Code Playgroud)

inpandas非常简单: df.T 但我不知道它是如何 in 的pyspark

Dav*_*itz 7

生成样本数据框

df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])
Run Code Online (Sandbox Code Playgroud)

选项 1:pyspark.pandas.DataFrame.T

对于大型数据帧,可能需要compute.max_rows

import pyspark.pandas as ps

ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
Run Code Online (Sandbox Code Playgroud)
(df
 .to_pandas_on_spark()
 .set_index('Cal')
 .T
 .reset_index()
 .rename(columns={"index":"Cal"})
 .to_spark()
 .show())

+----+---+---+---+---+
| Cal|  A|  U|  O| ER|
+----+---+---+---+---+
|Cal2| 12| 10|  5|  6|
|Cal3| 11|  9|  5|  5|
+----+---+---+---+---+
Run Code Online (Sandbox Code Playgroud)

选项 2:pyspark,困难的方法

import pyspark.sql.functions as F

header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)

df1 = (df
       .groupBy()
       .pivot('Cal')
       .agg(F.first(F.array(cols_minus_header)))
       .withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
      )
Run Code Online (Sandbox Code Playgroud)
df1.show(truncate = False)

+--------+------+------+-------+------------+
|       A|    ER|     O|      U|         Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+
Run Code Online (Sandbox Code Playgroud)
df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')
Run Code Online (Sandbox Code Playgroud)
df2.show(truncate = False)

+---+---+---+---+----+
|A  |ER |O  |U  |Cal |
+---+---+---+---+----+
|12 |6  |5  |10 |Cal2|
|11 |5  |5  |9  |Cal3|
+---+---+---+---+----+
Run Code Online (Sandbox Code Playgroud)