我有以下代码,它遍历数据帧并根据其他两个列更新列的块。当前的解决方案使用locinside itertuples.
是否可以在不诉诸手动并行化或拆分数据帧的情况下使代码更快?
n_rows = 10000
ix_ = pd.date_range(start="2020-01-01 00:00", freq="min", periods=n_rows)
offsets_ = pd.to_timedelta(np.random.randint(0, 60, size=n_rows), unit="min")
df = pd.DataFrame(
ix_ + pd.to_timedelta(offsets_, unit="min"), index=ix_, columns=["t_end"]
)
df["active"] = 0
for row in df.itertuples():
df.loc[row.Index : row.t_end, "active"] += 1
Run Code Online (Sandbox Code Playgroud)
这是一个运行时间约为 2 毫秒的选项(因此,比原始版本快约 3000 倍,比此处提出的其他解决方案快约 1000 倍)。
\n我们可以观察到,我们实际上是在将由开始和结束时间戳定义的间隔相加,因此我们可以使用以下内容以矢量化形式计算活动:
\n+1在每个间隔的开始位置和-1\n结束位置之后这是视觉上发生的情况:
\n\n在代码中它看起来像这样:
\ndef f_proposed(df):\n z = df.copy()\n z[\'active\'] = (z[\'active\']\n .add(1) # add start markers\n .sub(df # subtract end markers\n .groupby(\'t_end\') # at each \'t_end\'\n .size() # count end markers\n .reindex(df.index) # reindex to original\n .shift() # shift end markers by 1 row\n .fillna(0) # fill missing values with 0s\n .astype(int)) # convert back to int\n .cumsum()) # take cumulative sum\n return z\nRun Code Online (Sandbox Code Playgroud)\n计时(执行时间从约 6 秒变为约 2 毫秒)+ 测试我们得到正确的输出:
\n%%timeit\nf_original(df)\n# 6.66 s \xc2\xb1 25.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n%%timeit\nf_proposed(df)\n# 2.06 ms \xc2\xb1 7.46 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n# test\nz_original = f_original(df)\nz_proposed = f_proposed(df)\nz_proposed.equals(z_original)\n# True\nRun Code Online (Sandbox Code Playgroud)\nPS 为f_original我正在使用的:
def f_original(df):\n z = df.copy()\n for row in df.itertuples():\n z.loc[row.Index : row.t_end, "active"] += 1\n return z\nRun Code Online (Sandbox Code Playgroud)\n更新:这是插图的代码:
\nimport matplotlib.pyplot as plt\n\nx = np.zeros((3, 10))\nx[(2, 1, 0), (0, 1, 2)] = 1\nx[(2, 1, 0), (4, 7, 6)] = -1\nv = 3\n\nfig, ax = plt.subplots(1, 3, figsize=(12, 6))\n\n# markers\nax[0].set_title(\'Markers:\\nblue "start" (+1)\\nred "end" (-1)\')\nax[0].pcolormesh(x,cmap=\'RdBu\', vmin=-v, vmax=v, edgecolors=\'w\')\n\n# net started/ended\nax[1].set_title(\'Net started/ended =\\nSum of markers\')\nax[1].pcolormesh(\n np.sum(x, axis=0)[None, :],\n cmap=\'RdBu\', vmin=-v, vmax=v, edgecolors=\'w\')\n\n# activity\nax[2].set_title(\'Activity =\\nCumulative sum\')\nim = ax[2].pcolormesh(\n np.cumsum(np.sum(x, axis=0))[None, :],\n cmap=\'RdBu\', vmin=-v, vmax=v, edgecolors=\'w\')\n\nfor x in ax:\n x.yaxis.set_ticks([])\n x.xaxis.set_ticks([])\n x.set_aspect(\'equal\')\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
135 次 |
| 最近记录: |