spark数据框保留最新记录

jul*_*era 3 python apache-spark

我有一个类似于的数据框:

id  | date       | value
--- | ---------- | ------
1   | 2016-01-07 | 13.90
1   | 2016-01-16 | 14.50
2   | 2016-01-09 | 10.50
2   | 2016-01-28 | 5.50
3   | 2016-01-05 | 1.50
Run Code Online (Sandbox Code Playgroud)

我试图保持每个id的最新值,如下所示:

id  | date       | value
--- | ---------- | ------
1   | 2016-01-16 | 14.50
2   | 2016-01-28 | 5.50
3   | 2016-01-05 | 1.50
Run Code Online (Sandbox Code Playgroud)

我尝试按日期desc排序并删除重复项后:

new_df = df.orderBy(df.date.desc()).dropDuplicates(['id'])   
Run Code Online (Sandbox Code Playgroud)

我的问题是,dropDuplicates()将保留它找到的第一个重复值吗?有没有更好的方法来完成我想做的事情?顺便说一句,我正在使用python。

谢谢。

gen*_*nch 8

如果您有相同日期的项目,那么您将获得具有密集等级的重复项。您应该使用 row_number:

from pyspark.sql.window import Window
from datetime import date
?import pyspark.sql.functions as F 

rdd = spark.sparkContext.parallelize([
    [1, date(2016, 1, 7), 13.90],
    [1, date(2016, 1, 7), 10.0 ], # I added this row to show the effect of duplicate
    [1, date(2016, 1, 16), 14.50],
    [2, date(2016, 1, 9), 10.50],
    [2, date(2016, 1, 28), 5.50],
    [3, date(2016, 1, 5), 1.50]]
)
?
df = rdd.toDF(['id','date','price'])
df.show(10)

+---+----------+-----+
| id|      date|price|
+---+----------+-----+
|  1|2016-01-07| 13.9|
|  1|2016-01-07| 10.0|
|  1|2016-01-16| 14.5|
|  2|2016-01-09| 10.5|
|  2|2016-01-28|  5.5|
|  3|2016-01-05|  1.5|
+---+----------+-----+


# row_number
df.withColumn("row_number",F.row_number().over(Window.partitionBy(df.id).orderBy(df.date))).filter(F.col("row_number")==1).show()
?
+---+----------+-----+----------+
| id|      date|price|row_number|
+---+----------+-----+----------+
|  3|2016-01-05|  1.5|         1|
|  1|2016-01-07| 13.9|         1|
|  2|2016-01-09| 10.5|         1|
+---+----------+-----+----------+

# dense_rank
df.withColumn("dense_rank",F.dense_rank().over(Window.partitionBy(df.id).orderBy(df.date))).filter(F.col("dense_rank")==1).show()

+---+----------+-----+----------+
| id|      date|price|dense_rank|
+---+----------+-----+----------+
|  3|2016-01-05|  1.5|         1|
|  1|2016-01-07| 13.9|         1|
|  1|2016-01-07| 10.0|         1|
|  2|2016-01-09| 10.5|         1|
+---+----------+-----+----------+

Run Code Online (Sandbox Code Playgroud)


Fok*_*ong 7

建议的窗口运算符可以很好地解决此问题:

from datetime import date

rdd = sc.parallelize([
    [1, date(2016, 1, 7), 13.90],
    [1, date(2016, 1, 16), 14.50],
    [2, date(2016, 1, 9), 10.50],
    [2, date(2016, 1, 28), 5.50],
    [3, date(2016, 1, 5), 1.50]
])

df = rdd.toDF(['id','date','price'])
df.show()

+---+----------+-----+
| id|      date|price|
+---+----------+-----+
|  1|2016-01-07| 13.9|
|  1|2016-01-16| 14.5|
|  2|2016-01-09| 10.5|
|  2|2016-01-28|  5.5|
|  3|2016-01-05|  1.5|
+---+----------+-----+

df.registerTempTable("entries") // Replaced by createOrReplaceTempView in Spark 2.0

output = sqlContext.sql('''
    SELECT 
        *
    FROM (
        SELECT 
            *,
            dense_rank() OVER (PARTITION BY id ORDER BY date DESC) AS rank
        FROM entries
    ) vo WHERE rank = 1
''');

output.show();

+---+----------+-----+----+
| id|      date|price|rank|
+---+----------+-----+----+
|  1|2016-01-16| 14.5|   1|
|  2|2016-01-28|  5.5|   1|
|  3|2016-01-05|  1.5|   1|
+---+----------+-----+----+
Run Code Online (Sandbox Code Playgroud)


iad*_*7ya 6

您可以使用 row_number 获取最新日期的记录:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

new_df = df.withColumn("row_number",F.row_number().over(Window.partitionBy(df.id).orderBy(df.date.desc()))).filter(F.col("row_number")==1).drop("row_number")
Run Code Online (Sandbox Code Playgroud)