使用第1列和第2列填充第3列

Baz*_*Baz 10 python pandas

我是一个Python新手并拥有以下pandas数据帧 - 我正在尝试编写填充'signal'列的代码,如下所示:

Days    long_entry_flag long_exit_flag  signal
 1      FALSE           TRUE    
 2      FALSE           FALSE   
 3      TRUE            FALSE            1
 4      TRUE            FALSE            1
 5      FALSE           FALSE            1
 6      TRUE            FALSE            1
 7      TRUE            FALSE            1
 8      FALSE           TRUE    
 9      FALSE           TRUE    
 10     TRUE            FALSE            1
 11     TRUE            FALSE            1
 12     TRUE            FALSE            1
 13     FALSE           FALSE            1
 14     FALSE           TRUE    
 15     FALSE           FALSE   
 16     FALSE           TRUE    
 17     TRUE            FALSE            1
 18     TRUE            FALSE            1
 19     FALSE           FALSE            1
 20     FALSE           FALSE            1
 21     FALSE           TRUE    
 22     FALSE           FALSE
 23     FALSE           FALSE
Run Code Online (Sandbox Code Playgroud)

我的pseudo-code版本将采取以下步骤

  1. 向下看['long_entry_flag']列,直到输入条件为True(最初的第3天)
  2. 然后我们每天在['signal']列中输入'1',直到退出条件为True ['long_exit_flag'] ==第8天为真
  3. 然后我们回顾['long_entry_flag']列以等待下一个条目(发生在第10天)
  4. 我们再次每天在['signal']列中输入'1',直到退出条件为True(第14天)
  5. 等等

约的方式来填充快速(如果可能?使用矢量化)的"信号"列欢迎的想法 - 这是一个的大数据帧与排数以万计的一个子集,它是按顺序进行分析许多dataframes之一.

提前谢谢了!

ayo*_*rgo 7

你可以做

# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
    inside = (not row['long_exit_flag']
              if inside
              else row['long_entry_flag']
                  and not row['long_exit_flag']) # [True, True] case
    df.at[ix, 'signal'] = 1 if inside else np.nan
Run Code Online (Sandbox Code Playgroud)

这将准确地为您提供您发布的输出.


受到@ jezrael的回答的启发,我创造了一个稍微高效的上述版本,同时仍然尽力保持它尽可能整洁:

# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
    df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
                           if df.at[ix - 1, 'signal']
                           else df.at[ix, 'long_entry_flag']
                               and not df.at[ix, 'long_exit_flag']) # [True, True] case

# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])
Run Code Online (Sandbox Code Playgroud)


jez*_*ael 5

为了提高性能,请使用Numba解决方案:

arr = df[['long_exit_flag','long_entry_flag']].values

@jit
def f(A):
    inside = False
    out = np.ones(len(A), dtype=float)
    for i in range(len(arr)):
        inside = not A[i, 0] if inside else A[i, 1] 
        if not inside:
            out[i] = np.nan
    return out

df['signal'] = f(arr)
Run Code Online (Sandbox Code Playgroud)

表现:

#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [189]: %%timeit
     ...: inside = False
     ...: for ix, row in df.iterrows():
     ...:     inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
     ...:     df.at[ix, 'signal'] = 1 if inside else np.nan
     ...: 
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [190]: %%timeit
     ...: arr = df[['long_exit_flag','long_entry_flag']].values
     ...: 
     ...: @jit
     ...: def f(A):
     ...:     inside = False
     ...:     out = np.ones(len(A), dtype=float)
     ...:     for i in range(len(arr)):
     ...:         inside = not A[i, 0] if inside else A[i, 1] 
     ...:         if not inside:
     ...:             out[i] = np.nan
     ...:     return out
     ...: 
     ...: df['signal'] = f(arr)
     ...: 
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [200]: %%timeit
     ...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
     ...: 
     ...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
     ...: 
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)

你也可以使用numpy进行移位,同时@Dark代码也可以简化:

In [222]: %%timeit
     ...: d = np.where(~df['long_exit_flag'].values,  df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
     ...: shifted = np.insert(d[:-1], 0, np.nan)
     ...: m = (shifted==0) | (shifted==1)
     ...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
     ...: 
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)

编辑:

您还可以检查it iterrows是否存在性能问题?用于执行pandas中各种操作的一般优先顺序.