如何对pyspark中每个组内的变量进行排序?

sco*_*tle 4 pyspark pyspark-sql

我正在尝试val使用另一列ts对每个值进行排序id

# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd

# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()

+---+---+---+
| id| ts|val|
+---+---+---+
|  2|  2|cat|
|  1|  1|dog|
|  1|  2|cat|
|  2|  3|cat|
|  2|  4|dog|
+---+---+---+
Run Code Online (Sandbox Code Playgroud)

sco*_*tle 8

您可以按以下id方式进行聚合和排序ts

sorted_sdf = ( sdf.groupBy('id')
                  .agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
                  .alias('sorted_col') )  
             )

sorted_sdf.show()

+---+--------------------+
| id|          sorted_col|
+---+--------------------+
|  1|  [[1,dog], [2,cat]]|
|  2|[[2,cat], [3,cat]...|
+---+--------------------+
Run Code Online (Sandbox Code Playgroud)

然后,我们可以分解这个列表:

explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )

explode_sdf.show()

+---+--------------+
| id|sorted_explode|
+---+--------------+
|  1|       [1,dog]|
|  1|       [2,cat]|
|  2|       [2,cat]|
|  2|       [3,cat]|
|  2|       [4,dog]|
+---+--------------+
Run Code Online (Sandbox Code Playgroud)

将 的元组sorted_explode分成两个:

detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )

detupled_sdf.show()

+---+---+---+
| id| ts|val|
+---+---+---+
|  1|  1|dog|
|  1|  2|cat|
|  2|  2|cat|
|  2|  3|cat|
|  2|  4|dog|
+---+---+---+
Run Code Online (Sandbox Code Playgroud)

现在我们的原始数据框按ts每个排序id