leb*_*gue 3 python python-polars
我手头有一个相当大的数据框。将其自身加入需要一些时间。但我想将它们加入一些条件,这可能会使生成的数据帧小得多。我的问题是如何利用这些条件使条件连接比普通完全连接更快?
下面的代码用于说明:
import time
import numpy as np
import polars as pl
# example dataframe
rng = np.random.default_rng(1)
nrows = 3_000_000
df = pl.DataFrame(
dict(
day=rng.integers(1, 300, nrows),
id=rng.integers(1, 5_000, nrows),
id2=rng.integers(1, 5, nrows),
value=rng.normal(0, 1, nrows),
)
)
# joining df with itself takes around 10-15 seconds on a machine with 32 cores.
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left")
time.perf_counter() - start
# joining df with itself with extra conditions - the implementation below that takes very similar time (10-15 seconds).
start = time.perf_counter()
df.join(df, on=["id", "id2"], how="left").filter(
(pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
)
time.perf_counter() - start
Run Code Online (Sandbox Code Playgroud)
因此,如上所述,我的问题是如何利用连接期间的条件来使“条件连接”更快?
它应该更快,因为条件后生成的数据帧的行数比没有任何条件的完全连接少 10 倍。
小智 5
使用lazy 和streaming=True,速度更快:
In [5]: start = time.perf_counter()
...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
...: (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
...: ).collect()
...: time.perf_counter() - start
Out[5]: 11.083821532000002
In [6]: start = time.perf_counter()
...: df.lazy().join(df.lazy(), on=["id", "id2"], how="left").filter(
...: (pl.col("day") < pl.col("day_right")) & (pl.col("day_right") - pl.col("day") <= 30)
...: ).collect(streaming=True)
...: time.perf_counter() - start
Out[6]: 7.110704054997768
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1865 次 |
| 最近记录: |