Spark Dataframe - 如何根据 ID 和日期仅保留每个组的最新记录?

Joe*_*Joe 3 date dataframe apache-spark pyspark

我有一个数据框:

DF:

1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Run Code Online (Sandbox Code Playgroud)

如何只保留每个组的最新记录?(上面有 3 个组 (1,2,3))。

结果应该是:

1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Run Code Online (Sandbox Code Playgroud)

还试图使其高效(例如,在具有 1 亿条记录的中等集群上在几分钟内完成),因此应该以最有效和正确的方式进行排序/排序(如果需要)。

Rav*_*avi 6

您必须使用窗口函数。

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window

您必须按组和 OrderBy 时间对窗口进行分区,在 pyspark 脚本下方完成工作

from pyspark.sql.functions import *
from pyspark.sql.window import Window

schema = "Group int,time timestamp "

df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')


w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))

df.filter(df.Rank == 1).drop(df.Rank).show()


+-----+-------------------+
|Group|               time|
+-----+-------------------+
|    1|2016-11-18 14:47:05|
|    3|2016-10-12 17:24:25|
|    2|2016-10-12 22:24:25|
+-----+-------------------+ ```





Run Code Online (Sandbox Code Playgroud)

  • 你应该使用row_number()而不是dense_rank(),因为dense_rank()为“并列”行提供相同的排名 (7认同)

小智 0

对于这样的情况,您可以使用此处描述的窗口函数:

scala> val in = Seq((1,"2016-10-12 18:24:25"),
     | (1,"2016-11-18 14:47:05"),
     | (2,"2016-10-12 21:24:25"),
     | (2,"2016-10-12 20:24:25"),
     | (2,"2016-10-12 22:24:25"),
     | (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@59fa04f7

scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id|                 ts|rank|
+---+-------------------+----+
|  1|2016-11-18 14:47:05|   1|
|  3|2016-10-12 17:24:25|   1|
|  2|2016-10-12 22:24:25|   1|
+---+-------------------+----+
Run Code Online (Sandbox Code Playgroud)