在pyspark数据帧上计算百分比

Bal*_*a13 1 apache-spark pyspark spark-dataframe

我有一个泰坦尼克号数据的pyspark数据框,我在下面粘贴了一个副本。如何添加带有每个存储桶百分比的列?

在此处输入图片说明

谢谢您的帮助!

ece*_*ulm 7

首先是带有输入数据的文字DataFrame:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([
    (1,'female',233),
    (None,'female',314),
    (0,'female',81),
    (1, None, 342), 
    (1, 'male', 109),
    (None, None, 891),
    (0, None, 549),
    (None, 'male', 577),
    (0, None, 468)
    ], 
    ['survived', 'sex', 'count'])
Run Code Online (Sandbox Code Playgroud)

然后,我们使用窗口函数来计算包含完整行集的分区上的计数总和(本质上是总计数):

import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = df.withColumn('percent', f.col('count')/f.sum('count').over(Window.partitionBy()))
df.orderBy('percent', ascending=False).show()

+--------+------+-----+--------------------+
|survived|   sex|count|             percent|
+--------+------+-----+--------------------+
|    null|  null|  891|                0.25|
|    null|  male|  577| 0.16189674523007858|
|       0|  null|  549| 0.15404040404040403|
|       0|  null|  468| 0.13131313131313133|
|       1|  null|  342| 0.09595959595959595|
|    null|female|  314| 0.08810325476992144|
|       1|female|  233|  0.0653759820426487|
|       1|  male|  109| 0.03058361391694725|
|       0|female|   81|0.022727272727272728|
+--------+------+-----+--------------------+
Run Code Online (Sandbox Code Playgroud)

如果将上述步骤分为两步,则很容易看到window函数sum只是将相同的total值添加到每一行

df = df\
  .withColumn('total', f.sum('count').over(Window.partitionBy()))\
  .withColumn('percent', f.col('count')/f.col('total'))
df.show()

+--------+------+-----+--------------------+-----+
|survived|   sex|count|             percent|total|
+--------+------+-----+--------------------+-----+
|       1|female|  233|  0.0653759820426487| 3564|
|    null|female|  314| 0.08810325476992144| 3564|
|       0|female|   81|0.022727272727272728| 3564|
|       1|  null|  342| 0.09595959595959595| 3564|
|       1|  male|  109| 0.03058361391694725| 3564|
|    null|  null|  891|                0.25| 3564|
|       0|  null|  549| 0.15404040404040403| 3564|
|    null|  male|  577| 0.16189674523007858| 3564|
|       0|  null|  468| 0.13131313131313133| 3564|
+--------+------+-----+--------------------+-----+
Run Code Online (Sandbox Code Playgroud)


Rob*_*inL 6

这可能是使用 Spark 的选项,因为它是最“有意”使用的(即,它不涉及显式地将数据收集到驱动程序,并且不会导致生成任何警告:

df = spark.createDataFrame([
    (1,'female',233),
    (None,'female',314),
    (0,'female',81),
    (1, None, 342), 
    (1, 'male', 109),
    (None, None, 891),
    (0, None, 549),
    (None, 'male', 577),
    (0, None, 468)
    ], 
    ['survived', 'sex', 'count'])

df.registerTempTable("df")

sql = """
select *, count/(select sum(count) from df) as percentage
from df
"""

spark.sql(sql).show()
Run Code Online (Sandbox Code Playgroud)

请注意,对于您通常在 Spark 中处理的较大数据集类型,您不希望使用具有window跨越整个数据集的解决方案(例如w = Window.partitionBy())。事实上 Spark 会警告你这一点:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Run Code Online (Sandbox Code Playgroud)

为了说明区别,这里是非窗口版本

sql = """
select *, count/(select sum(count) from df) as percentage
from df
"""
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

请注意,在任何时候,所有 9 行都不会被洗牌到单个执行器。

这是带窗口的版本:

sql = """
select *, count/sum(count) over () as perc
from df
"""

Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

请注意交换(洗牌)步骤中以及发生单分区数据交换的数据量较大:


rog*_*one 1

像下面这样的东西应该可以工作。

df = sc.parallelize([(1,'female',233), (None,'female',314),(0,'female',81),(1, None, 342), (1, 'male', 109)]).toDF().withColumnRenamed("_1","survived").withColumnRenamed("_2","sex").withColumnRenamed("_3","count")
total = df.select("count").agg({"count": "sum"}).collect().pop()['sum(count)']
result = df.withColumn('percent', (df['count']/total) * 100)
result.show()

+--------+------+-----+------------------+
|survived|   sex|count|           percent|
+--------+------+-----+------------------+
|       1|female|  233| 21.59406858202039|
|    null|female|  314|29.101019462465246|
|       0|female|   81| 7.506950880444857|
|       1|  null|  342| 31.69601482854495|
|       1|  male|  109|10.101946246524559|
+--------+------+-----+------------------+
Run Code Online (Sandbox Code Playgroud)