这是我的审判。
首先,我确定哪个Date_2符合您的条件。之后,再次加入第二个数据框并得到Value_2
from pyspark.sql.functions import monotonically_increasing_id, unix_timestamp, max
df3 = df1.withColumn('newId', monotonically_increasing_id()) \
.join(df2, 'ID', 'left') \
.where(unix_timestamp('Date', 'M/dd/yy') >= unix_timestamp('Date_2', 'M/dd/yy')) \
.groupBy(*df1.columns, 'newId') \
.agg(max('Date_2').alias('Date_2'))
df3.orderBy('newId').show(20, False)
+---+-------+-----+-----+-------+
|ID |Date |Value|newId|Date_2 |
+---+-------+-----+-----+-------+
|A1 |1/15/20|5 |0 |1/12/20|
|A2 |1/20/20|10 |1 |1/11/20|
|A3 |2/21/20|12 |2 |1/31/20|
|A1 |1/21/20|6 |3 |1/16/20|
+---+-------+-----+-----+-------+
df3.join(df2, ['ID', 'Date_2'], 'left') \
.orderBy('newId') \
.drop('Date_2', 'newId') \
.show(20, False)
+---+-------+-----+-------+
|ID |Date |Value|Value_2|
+---+-------+-----+-------+
|A1 |1/15/20|5 |5 |
|A2 |1/20/20|10 |12 |
|A3 |2/21/20|12 |14 |
|A1 |1/21/20|6 |3 |
+---+-------+-----+-------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5132 次 |
| 最近记录: |