如何在 Polars 中更有效地执行条件连接?

leb*_*gue 3 python python-polars

我手头有一个相当大的数据框。将其自身加入需要一些时间。但我想将它们加入一些条件,这可能会使生成的数据帧小得多。我的问题是如何利用这些条件使条件连接比普通完全连接更快?

下面的代码用于说明:

import time
import numpy as np
import polars as pl

# example dataframe
rng = np.random.default_rng(1)

nrows = 3_000_000
df = pl.DataFrame(
    dict(
        day=rng.integers(1, 300, nrows),
        id=rng.integers(1, 5_000, nrows),
        id2=rng.integers(1, 5, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

# joining df with itself takes around 10-15 seconds on a machine with 32 cores.
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left")
time.perf_counter() - start

# joining df with itself with extra conditions - the implementation below that takes very similar time (10-15 seconds).
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left").filter(
    (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
)
time.perf_counter() - start
Run Code Online (Sandbox Code Playgroud)

因此,如上所述,我的问题是如何利用连接期间的条件来使“条件连接”更快?

它应该更快,因为条件后生成的数据帧的行数比没有任何条件的完全连接少 10 倍。

小智 5

使用lazy 和streaming=True,速度更快:

In [5]: start = time.perf_counter()
   ...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
   ...:     (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
   ...: ).collect()
   ...: time.perf_counter() - start
Out[5]: 11.083821532000002

In [6]: start = time.perf_counter()
   ...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
   ...:     (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
   ...: ).collect(streaming=True)
   ...: time.perf_counter() - start
Out[6]: 7.110704054997768
Run Code Online (Sandbox Code Playgroud)