我有一个包含 3 列的数据集,id, day, value。我需要为和 的value所有组合添加带零的行。idday
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])
Run Code Online (Sandbox Code Playgroud)
我想出的是:
# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)
# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
.na.fill(value=0,subset=['value'])
Run Code Online (Sandbox Code Playgroud)
哪个输出我需要的:
>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| …Run Code Online (Sandbox Code Playgroud)