我是一个Python新手并拥有以下pandas数据帧 - 我正在尝试编写填充'signal'列的代码,如下所示:
Days long_entry_flag long_exit_flag signal
1 FALSE TRUE
2 FALSE FALSE
3 TRUE FALSE 1
4 TRUE FALSE 1
5 FALSE FALSE 1
6 TRUE FALSE 1
7 TRUE FALSE 1
8 FALSE TRUE
9 FALSE TRUE
10 TRUE FALSE 1
11 TRUE FALSE 1
12 TRUE FALSE 1
13 FALSE FALSE 1
14 FALSE TRUE
15 FALSE FALSE
16 FALSE TRUE
17 TRUE FALSE 1
18 TRUE FALSE 1
19 FALSE FALSE 1
20 FALSE FALSE 1
21 FALSE TRUE
22 FALSE FALSE
23 FALSE FALSE
Run Code Online (Sandbox Code Playgroud)
我的pseudo-code版本将采取以下步骤
约的方式来填充快速(如果可能?使用矢量化)的"信号"列欢迎的想法 - 这是一个的大数据帧与排数以万计的一个子集,它是按顺序进行分析许多dataframes之一.
提前谢谢了!
你可以做
# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
inside = (not row['long_exit_flag']
if inside
else row['long_entry_flag']
and not row['long_exit_flag']) # [True, True] case
df.at[ix, 'signal'] = 1 if inside else np.nan
Run Code Online (Sandbox Code Playgroud)
这将准确地为您提供您发布的输出.
受到@ jezrael的回答的启发,我创造了一个稍微高效的上述版本,同时仍然尽力保持它尽可能整洁:
# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
if df.at[ix - 1, 'signal']
else df.at[ix, 'long_entry_flag']
and not df.at[ix, 'long_exit_flag']) # [True, True] case
# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])
Run Code Online (Sandbox Code Playgroud)
为了提高性能,请使用Numba解决方案:
arr = df[['long_exit_flag','long_entry_flag']].values
@jit
def f(A):
inside = False
out = np.ones(len(A), dtype=float)
for i in range(len(arr)):
inside = not A[i, 0] if inside else A[i, 1]
if not inside:
out[i] = np.nan
return out
df['signal'] = f(arr)
Run Code Online (Sandbox Code Playgroud)
表现:
#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)
In [189]: %%timeit
...: inside = False
...: for ix, row in df.iterrows():
...: inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
...: df.at[ix, 'signal'] = 1 if inside else np.nan
...:
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [190]: %%timeit
...: arr = df[['long_exit_flag','long_entry_flag']].values
...:
...: @jit
...: def f(A):
...: inside = False
...: out = np.ones(len(A), dtype=float)
...: for i in range(len(arr)):
...: inside = not A[i, 0] if inside else A[i, 1]
...: if not inside:
...: out[i] = np.nan
...: return out
...:
...: df['signal'] = f(arr)
...:
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [200]: %%timeit
...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
...:
...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
...:
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)
你也可以使用numpy进行移位,同时@Dark代码也可以简化:
In [222]: %%timeit
...: d = np.where(~df['long_exit_flag'].values, df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
...: shifted = np.insert(d[:-1], 0, np.nan)
...: m = (shifted==0) | (shifted==1)
...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
...:
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)
编辑:
您还可以检查it iterrows是否存在性能问题?用于执行pandas中各种操作的一般优先顺序.
| 归档时间: |
|
| 查看次数: |
560 次 |
| 最近记录: |