如何在 Polars 中更有效地执行条件连接？

Question

如何在 Polars 中更有效地执行条件连接？

我手头有一个相当大的数据框。将其自身加入需要一些时间。但我想将它们加入一些条件，这可能会使生成的数据帧小得多。我的问题是如何利用这些条件使条件连接比普通完全连接更快？

下面的代码用于说明：

import time
import numpy as np
import polars as pl

# example dataframe
rng = np.random.default_rng(1)

nrows = 3_000_000
df = pl.DataFrame(
    dict(
        day=rng.integers(1, 300, nrows),
        id=rng.integers(1, 5_000, nrows),
        id2=rng.integers(1, 5, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

# joining df with itself takes around 10-15 seconds on a machine with 32 cores.
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left")
time.perf_counter() - start

# joining df with itself with extra conditions - the implementation below that takes very similar time (10-15 seconds).
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left").filter(
    (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
)
time.perf_counter() - start

Run Code Online (Sandbox Code Playgroud)

因此，如上所述，我的问题是如何利用连接期间的条件来使“条件连接”更快？

它应该更快，因为条件后生成的数据帧的行数比没有任何条件的完全连接少 10 倍。

Answer 1

小智 5

使用lazy 和streaming=True，速度更快：

In [5]: start = time.perf_counter()
   ...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
   ...:     (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
   ...: ).collect()
   ...: time.perf_counter() - start
Out[5]: 11.083821532000002

In [6]: start = time.perf_counter()
   ...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
   ...:     (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
   ...: ).collect(streaming=True)
   ...: time.perf_counter() - start
Out[6]: 7.110704054997768

Run Code Online (Sandbox Code Playgroud)

归档时间：	2 年，9 月前
查看次数：	1865 次
最近记录：	2 年，9 月前