TiT*_*iTo 2 python group-by dataframe pyspark
假设我有以下 df
df = spark.createDataFrame([
("a", "apple"),
("a", "pear"),
("b", "pear"),
("c", "carrot"),
("c", "apple"),
], ["id", "fruit"])
+---+-------+
| id| fruit|
+---+-------+
| a| apple|
| a| pear|
| b| pear|
| c| carrot|
| c| apple|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
我现在想为每个在水果列中TRUE至少有一列的 id创建一个布尔标志。"pear"fruit
所需的输出如下所示:
+---+-------+------+
| id| fruit| flag|
+---+-------+------+
| a| apple| True|
| a| pear| True|
| b| pear| True|
| c| carrot| False|
| c| apple| False|
+---+-------+------+
Run Code Online (Sandbox Code Playgroud)
对于 pandas,我在groupby().transform() 这里找到了一个解决方案,但我不明白如何将其转换为 PySpark。
使用max窗函数:
df.selectExpr("*", "max(fruit = 'pear') over (partition by id) as flag").show()
+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot|false|
| c| apple|false|
| b| pear| true|
| a| apple| true|
| a| pear| true|
+---+------+-----+
Run Code Online (Sandbox Code Playgroud)
如果需要检查多个水果,可以使用in运算符。例如检查carrot和apple:
df.selectExpr("*", "max(fruit in ('carrot', 'apple')) over (partition by id) as flag").show()
+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot| true|
| c| apple| true|
| b| pear|false|
| a| apple| true|
| a| pear| true|
+---+------+-----+
Run Code Online (Sandbox Code Playgroud)
如果您更喜欢 python 语法:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df.select("*",
f.max(
f.col('fruit').isin(['carrot', 'apple'])
).over(Window.partitionBy('id')).alias('flag')
).show()
+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot| true|
| c| apple| true|
| b| pear|false|
| a| apple| true|
| a| pear| true|
+---+------+-----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2436 次 |
| 最近记录: |