小编Jes*_*ess的帖子

pyspark:聚合列中最频繁的值

  aggregrated_table = df_input.groupBy('city', 'income_bracket') \
        .agg(
       count('suburb').alias('suburb'),
       sum('population').alias('population'),
       sum('gross_income').alias('gross_income'),
       sum('no_households').alias('no_households'))
Run Code Online (Sandbox Code Playgroud)

想按城市和收入等级分组,但在每个城市内,某些郊区有不同的收入等级。我如何按每个城市最常出现的收入等级分组?

例如:

  aggregrated_table = df_input.groupBy('city', 'income_bracket') \
        .agg(
       count('suburb').alias('suburb'),
       sum('population').alias('population'),
       sum('gross_income').alias('gross_income'),
       sum('no_households').alias('no_households'))
Run Code Online (Sandbox Code Playgroud)

将按income_bracket_10 分组

group-by aggregate pyspark

2
推荐指数
2
解决办法
6316
查看次数

标签 统计

aggregate ×1

group-by ×1

pyspark ×1