如何在 pyspark 中转置数据框？

Question

如何在 pyspark 中转置数据框？

我在 pyspark 中找不到任何用于转置数据帧的函数。

Cal   Cal2   Cal3
'A'    12     11
'U'    10     9
'O'     5     5
'ER'    6     5


  
Cal    'A'   'U'   'O'   'ER'  
Cal2    12    10    5     6    
Cal3    11     9    5     5

Run Code Online (Sandbox Code Playgroud)

inpandas非常简单： df.T 但我不知道它是如何 in 的pyspark！

Answer 1

Dav*_*itz 7

生成样本数据框

df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])

Run Code Online (Sandbox Code Playgroud)

选项 1：pyspark.pandas.DataFrame.T

对于大型数据帧，可能需要compute.max_rows

import pyspark.pandas as ps

ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)

Run Code Online (Sandbox Code Playgroud)

(df
 .to_pandas_on_spark()
 .set_index('Cal')
 .T
 .reset_index()
 .rename(columns={"index":"Cal"})
 .to_spark()
 .show())

+----+---+---+---+---+
| Cal|  A|  U|  O| ER|
+----+---+---+---+---+
|Cal2| 12| 10|  5|  6|
|Cal3| 11|  9|  5|  5|
+----+---+---+---+---+

Run Code Online (Sandbox Code Playgroud)

选项 2：pyspark，困难的方法

import pyspark.sql.functions as F

header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)

df1 = (df
       .groupBy()
       .pivot('Cal')
       .agg(F.first(F.array(cols_minus_header)))
       .withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
      )

Run Code Online (Sandbox Code Playgroud)

df1.show(truncate = False)

+--------+------+------+-------+------------+
|       A|    ER|     O|      U|         Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+

Run Code Online (Sandbox Code Playgroud)

df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')

Run Code Online (Sandbox Code Playgroud)

df2.show(truncate = False)

+---+---+---+---+----+
|A  |ER |O  |U  |Cal |
+---+---+---+---+----+
|12 |6  |5  |10 |Cal2|
|11 |5  |5  |9  |Cal3|
+---+---+---+---+----+

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，10 月前
查看次数：	5186 次
最近记录：	3 年，10 月前