Pou*_*del 1 python sql pandas apache-spark pyspark
我有一张这样的表:
+-------+-----+------+------+
|user_id|apple|good banana|carrot|
+-------+-----+------+------+
| user_0| 0| 3| 1|
| user_1| 1| 0| 2|
| user_2| 5| 1| 2|
+-------+-----+------+------+
Run Code Online (Sandbox Code Playgroud)
在这里,对于每个水果,我想获取购买最多商品的客户列表。所需的输出如下:
max_user max_count
apple [user_2] 5
banana [user_0] 3
carrot [user_1, user_2] 2
Run Code Online (Sandbox Code Playgroud)
+-------+-----+------+------+
|user_id|apple|good banana|carrot|
+-------+-----+------+------+
| user_0| 0| 3| 1|
| user_1| 1| 0| 2|
| user_2| 5| 1| 2|
+-------+-----+------+------+
Run Code Online (Sandbox Code Playgroud)
如何使用 Pyspark 获得所需的输出?
如何使用 Pyspark sql 获取所需的输出?
我已经做了一些研究并搜索了多个页面。到目前为止,我已经想出了一个接近的答案,但它需要转置表,这里我的表是正常的。另外,我正在学习多种方法,例如 Spark 方法和 SQL 方法。
Pyspark 解决方案。类似于 pandas 解决方案,您首先使用 融化数据框stack,然后使用最大计数过滤行rank,分组依据fruit,并使用 获取用户列表collect_list。
from pyspark.sql import functions as F, Window
df2 = df.selectExpr(
'user_id',
'stack(3, ' + ', '.join(["'%s', %s" % (c, c) for c in df.columns[1:]]) + ') as (fruit, items)'
).withColumn(
'rn',
F.rank().over(Window.partitionBy('fruit').orderBy(F.desc('items')))
).filter('rn = 1').groupBy('fruit').agg(
F.collect_list('user_id').alias('max_user'),
F.max('items').alias('max_count')
)
df2.show()
+------+----------------+---------+
| fruit| max_user|max_count|
+------+----------------+---------+
| apple| [user_2]| 5|
|banana| [user_0]| 3|
|carrot|[user_1, user_2]| 2|
+------+----------------+---------+
Run Code Online (Sandbox Code Playgroud)
对于 Spark SQL:
df.createOrReplaceTempView("grocery")
df2 = spark.sql("""
select
fruit,
collect_list(user_id) as max_user,
max(items) as max_count
from (
select *,
rank() over (partition by fruit order by items desc) as rn
from (
select
user_id,
stack(3, 'apple', apple, 'banana', banana, 'carrot', carrot) as (fruit, items)
from grocery
)
)
where rn = 1 group by fruit
""")
df2.show()
+------+----------------+---------+
| fruit| max_user|max_count|
+------+----------------+---------+
| apple| [user_2]| 5|
|banana| [user_0]| 3|
|carrot|[user_1, user_2]| 2|
+------+----------------+---------+
Run Code Online (Sandbox Code Playgroud)