Polars 自动缓存机制如何在 LazyFrames 上工作?

bar*_*412 5 python rust-polars python-polars

正如这里所述,Polars 为 LazyFrames 引入了自动缓存机制,该机制在逻辑计划中多次出现,因此用户不必主动执行缓存。
然而,在尝试检查他们的新机制时,我遇到了自动缓存未最佳执行的情况:

没有显式缓存:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())
Run Code Online (Sandbox Code Playgroud)

我们得到了逻辑计划:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      INNER JOIN:
      LEFT PLAN ON: [col("id")]
        CACHE[id: a4bcf9591fefc837, count: 3]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      RIGHT PLAN ON: [col("id")]
        CACHE[id: 8cee8e3a6f454983, count: 1]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    CACHE[id: a4bcf9591fefc837, count: 3]
      DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
  PLAN 2:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: a4bcf9591fefc837, count: 3]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    RIGHT PLAN ON: [col("id")]
      CACHE[id: 8cee8e3a6f454983, count: 1]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    END INNER JOIN
END UNION
Run Code Online (Sandbox Code Playgroud)

使用显式缓存:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())
Run Code Online (Sandbox Code Playgroud)

我们得到了逻辑计划:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: 290661b0780, count: 18446744073709551615]
        FAST_PROJECT: [id]
          INNER JOIN:
          LEFT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          RIGHT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
  PLAN 2:
    CACHE[id: 290661b0780, count: 18446744073709551615]
      FAST_PROJECT: [id]
        INNER JOIN:
        LEFT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        RIGHT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        END INNER JOIN
END UNION
Run Code Online (Sandbox Code Playgroud)

您可以看到,使用显式缓存,我们可以获得更优化的计划,因为df1和的连接df2仅执行一次。

为什么Polars自动缓存机制不检测join的重复使用,并自行应用缓存?我缺少什么?

谢谢。