如何计算数据框的百分比

Jua*_*ith 1 python apache-spark pyspark

我有一个模拟到如下所示数据框的场景。

Area   Type    NrPeople     
1      House    200
1      Flat     100
2      House    300
2      Flat     400
3      House   1000
4      Flat     250
Run Code Online (Sandbox Code Playgroud)

如何想按降序计算和返回每个区域的人的Nr,但最重要的是我很难计算总百分比。

结果应如下所示:

Area   SumPeople      %     
3       1000        44%
2        700        31%
1        300        13%
4        250        11%
Run Code Online (Sandbox Code Playgroud)

请参见下面的代码示例:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
                              ("1", "Flat", "100"), 
                              ("2", "House", "300"), 
                              ("2", "Flat", "400"),
                              ("3", "House", "1000"), 
                              ("4", "Flat", "250")],
                              ["Area", "Type", "NrPeople"])

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending=False)\
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
Run Code Online (Sandbox Code Playgroud)

失败的原因:/不支持的操作数类型:“ int”和“ DataFrame”

任何想法都欢迎您这样做!

Pus*_*hkr 5

您需要窗口功能-

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
Run Code Online (Sandbox Code Playgroud)

输出:

+----+---------+------------------+
|Area|SumPeople|           Percent|
+----+---------+------------------+
|   3|   1000.0| 44.44444444444444|
|   2|    700.0| 31.11111111111111|
|   1|    300.0|13.333333333333334|
|   4|    250.0| 11.11111111111111|
+----+---------+------------------+
Run Code Online (Sandbox Code Playgroud)