Jua*_*ith 1 python apache-spark pyspark
我有一个模拟到如下所示数据框的场景。
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
Run Code Online (Sandbox Code Playgroud)
如何想按降序计算和返回每个区域的人的Nr,但最重要的是我很难计算总百分比。
结果应如下所示:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
Run Code Online (Sandbox Code Playgroud)
请参见下面的代码示例:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
Run Code Online (Sandbox Code Playgroud)
失败的原因:/不支持的操作数类型:“ int”和“ DataFrame”
任何想法都欢迎您这样做!
您需要窗口功能-
import pyspark.sql.functions as fn
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window
window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
Run Code Online (Sandbox Code Playgroud)
输出:
+----+---------+------------------+
|Area|SumPeople| Percent|
+----+---------+------------------+
| 3| 1000.0| 44.44444444444444|
| 2| 700.0| 31.11111111111111|
| 1| 300.0|13.333333333333334|
| 4| 250.0| 11.11111111111111|
+----+---------+------------------+
Run Code Online (Sandbox Code Playgroud)