我在 pyspark 中找不到任何用于转置数据帧的函数。
Cal Cal2 Cal3
'A' 12 11
'U' 10 9
'O' 5 5
'ER' 6 5
Cal 'A' 'U' 'O' 'ER'
Cal2 12 10 5 6
Cal3 11 9 5 5
Run Code Online (Sandbox Code Playgroud)
inpandas非常简单: df.T 但我不知道它是如何 in 的pyspark!
生成样本数据框
df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])
Run Code Online (Sandbox Code Playgroud)
选项 1:pyspark.pandas.DataFrame.T
对于大型数据帧,可能需要compute.max_rows
import pyspark.pandas as ps
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
Run Code Online (Sandbox Code Playgroud)
(df
.to_pandas_on_spark()
.set_index('Cal')
.T
.reset_index()
.rename(columns={"index":"Cal"})
.to_spark()
.show())
+----+---+---+---+---+
| Cal| A| U| O| ER|
+----+---+---+---+---+
|Cal2| 12| 10| 5| 6|
|Cal3| 11| 9| 5| 5|
+----+---+---+---+---+
Run Code Online (Sandbox Code Playgroud)
选项 2:pyspark,困难的方法
import pyspark.sql.functions as F
header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)
df1 = (df
.groupBy()
.pivot('Cal')
.agg(F.first(F.array(cols_minus_header)))
.withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
)
Run Code Online (Sandbox Code Playgroud)
df1.show(truncate = False)
+--------+------+------+-------+------------+
| A| ER| O| U| Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+
Run Code Online (Sandbox Code Playgroud)
df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')
Run Code Online (Sandbox Code Playgroud)
df2.show(truncate = False)
+---+---+---+---+----+
|A |ER |O |U |Cal |
+---+---+---+---+----+
|12 |6 |5 |10 |Cal2|
|11 |5 |5 |9 |Cal3|
+---+---+---+---+----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5186 次 |
| 最近记录: |