bar*_*412 5 python rust-polars python-polars
正如这里所述,Polars 为 LazyFrames 引入了自动缓存机制,该机制在逻辑计划中多次出现,因此用户不必主动执行缓存。
然而,在尝试检查他们的新机制时,我遇到了自动缓存未最佳执行的情况:
没有显式缓存:
import polars as pl
df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()
df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
df4]).explain())
Run Code Online (Sandbox Code Playgroud)
我们得到了逻辑计划:
UNION
PLAN 0:
INNER JOIN:
LEFT PLAN ON: [col("id")]
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
CACHE[id: 8cee8e3a6f454983, count: 1]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
END INNER JOIN
PLAN 1:
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
PLAN 2:
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
CACHE[id: 8cee8e3a6f454983, count: 1]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
END UNION
Run Code Online (Sandbox Code Playgroud)
使用显式缓存:
import polars as pl
df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()
df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
df4]).explain())
Run Code Online (Sandbox Code Playgroud)
我们得到了逻辑计划:
UNION
PLAN 0:
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: 290661b0780, count: 18446744073709551615]
FAST_PROJECT: [id]
INNER JOIN:
LEFT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
END INNER JOIN
PLAN 1:
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
PLAN 2:
CACHE[id: 290661b0780, count: 18446744073709551615]
FAST_PROJECT: [id]
INNER JOIN:
LEFT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
END UNION
Run Code Online (Sandbox Code Playgroud)
您可以看到,使用显式缓存,我们可以获得更优化的计划,因为df1和的连接df2仅执行一次。
为什么Polars自动缓存机制不检测join的重复使用,并自行应用缓存?我缺少什么?
谢谢。
| 归档时间: |
|
| 查看次数: |
178 次 |
| 最近记录: |