在列索引组中的数据框中保留每行的前N个值

s5s*_*s5s 3 python performance numpy dataframe pandas

我找不到这个问题的优雅解决方案(可能没有).

我有以下示例DataFrame:

np.random.seed(0)

df = pd.DataFrame(np.random.randn(10,10)).abs()

          0         1         2         3         4         5         6  \
0  1.764052  0.400157  0.978738  2.240893  1.867558  0.977278  0.950088   
1  0.144044  1.454274  0.761038  0.121675  0.443863  0.333674  1.494079   
2  2.552990  0.653619  0.864436  0.742165  2.269755  1.454366  0.045759   
3  0.154947  0.378163  0.887786  1.980796  0.347912  0.156349  1.230291   
4  1.048553  1.420018  1.706270  1.950775  0.509652  0.438074  1.252795   
5  0.895467  0.386902  0.510805  1.180632  0.028182  0.428332  0.066517   
6  0.672460  0.359553  0.813146  1.726283  0.177426  0.401781  1.630198   
7  0.729091  0.128983  1.139401  1.234826  0.402342  0.684810  0.870797   
8  1.165150  0.900826  0.465662  1.536244  1.488252  1.895889  1.178780   
9  0.403177  1.222445  0.208275  0.976639  0.356366  0.706573  0.010500   

          7         8         9  
0  0.151357  0.103219  0.410599  
1  0.205158  0.313068  0.854096  
2  0.187184  1.532779  1.469359  
3  1.202380  0.387327  0.302303  
4  0.777490  1.613898  0.212740  
5  0.302472  0.634322  0.362741  
6  0.462782  0.907298  0.051945  
7  0.578850  0.311553  0.056165  
8  0.179925  1.070753  1.054452  
9  1.785870  0.126912  0.401989  
Run Code Online (Sandbox Code Playgroud)

我有以下区域地图:

zones = {"A":[0,1,2],"B":[3,4],"C":[5,6,7,8],"D":[9]}

区域显示我应该一起检查的列组和df [columns] DataFrame的每一,保留前N个项目(NB:保持前N个项目,即横截面 - 见后面),将其余部分设为零.例如,对于N = 2的区域"A",我将检查以下DataFrame:

          0         1         2
0  1.764052  0.400157  0.978738
1  0.144044  1.454274  0.761038
2  2.552990  0.653619  0.864436
3  0.154947  0.378163  0.887786
4  1.048553  1.420018  1.706270
5  0.895467  0.386902  0.510805
6  0.672460  0.359553  0.813146
7  0.729091  0.128983  1.139401
8  1.165150  0.900826  0.465662
9  0.403177  1.222445  0.208275  
Run Code Online (Sandbox Code Playgroud)

因为N = 2,我将保留前N项:

          0         1         2
0  1.764052  0.        0.978738
1  0.        1.454274  0.761038
2  2.552990  0.        0.864436
3  0.        0.378163  0.887786
4  0.        1.420018  1.706270
5  0.895467  0.        0.510805
6  0.672460  0.        0.813146
7  0.729091  0.        1.139401
8  1.165150  0.900826  0.
9  0.403177  1.222445  0.
Run Code Online (Sandbox Code Playgroud)

上面带有区域图并且N = 2的整个输出将如下所示:

          0         1         2         3         4         5         6  \
0  1.764052  0.        0.978738  2.240893  1.867558  0.977278  0.950088   
1  0.        1.454274  0.761038  0.121675  0.443863  0.333674  1.494079   
2  2.552990  0.        0.864436  0.742165  2.269755  1.454366  0.         
3  0.        0.378163  0.887786  1.980796  0.347912  0.        1.230291   
4  0.        1.420018  1.706270  1.950775  0.509652  0.        1.252795   
5  0.895467  0.        0.510805  1.180632  0.028182  0.428332  0.         
6  0.672460  0.        0.813146  1.726283  0.177426  0.        1.630198   
7  0.729091  0.        1.139401  1.234826  0.402342  0.684810  0.870797   
8  1.165150  0.900826  0.        1.536244  1.488252  1.895889  1.178780   
9  0.403177  1.222445  0.        0.976639  0.356366  0.706573  0.         

          7         8         9  
0  0.        0.        0.410599  
1  0.        0.        0.854096  
2  0.        1.532779  1.469359  
3  1.202380  0.        0.302303  
4  0.        1.613898  0.212740  
5  0.        0.634322  0.362741  
6  0.        0.907298  0.051945  
7  0.        0.        0.056165  
8  0.        0.        1.054452  
9  1.785870  0.        0.401989  
Run Code Online (Sandbox Code Playgroud)

我试图解决这个问题的方式感觉有点慢.我循环遍历区域,然后我得到一个zone_df,然后我循环遍历行,排序每一行并调用row.head(len(row) - N)以获取需要设置为0的索引和列.然后使用这些值(在dict中)将zone_df中的单元格设置为零,然后组合zone_dfs.

Div*_*kar 5

这是一种方式 -

def keeptopN_perkey(df, zones, N=2):
    a = df.values
    indx = zones.values()
    r = np.arange(a.shape[0])[:,None]
    for i in indx:
        b = a[:,i]
        L = np.maximum(len(i)-N,0)
        if L>0:
            idx = np.argpartition(b, L, axis=1)[:,:L] 
            # or np.argsort(b,axis=1)[:,:L]
            b[r, idx] = 0
        a[:,i] = b
    return df
Run Code Online (Sandbox Code Playgroud)

好处是我们正在回写输入数据帧,而无需在使用底层数组数据的帮助下创建输出数据帧.

样品运行 -

In [303]: np.random.seed(0)
     ...: N = 2
     ...: df = pd.DataFrame(np.random.randint(11,99,(4,10)))
     ...: zones = {"A": [0,1,2], "B": [3,4], "C": [5, 6,7,8], "D": [9]}
     ...: 

In [304]: df
Out[304]: 
    0   1   2   3   4   5   6   7   8   9
0  55  58  75  78  78  20  94  32  47  98
1  81  23  69  76  50  98  57  92  48  36
2  88  83  20  31  91  80  90  58  75  93
3  60  40  30  30  25  50  43  76  20  68

In [305]: keeptopN_perkey(df, zones, N=2)
Out[305]: 
    0   1   2   3   4   5   6   7   8   9
0   0  58  75  78  78   0  94   0  47  98
1  81   0  69  76  50  98   0  92   0  36
2  88  83   0  31  91  80  90   0   0  93
3  60  40   0  30  25  50   0  76   0  68
Run Code Online (Sandbox Code Playgroud)

标杆

其他职位的方法 -

def mask_n(df, n): # @piRSquared's helper func
    v = np.zeros(df.shape, dtype=bool)
    n = min(n, v.shape[1])
    if v.shape[1] > n:
        j = np.argpartition(-df.values, n, 1)[:, :n].ravel()
        i = np.arange(v.shape[0]).repeat(n)
        v[i, j] = True
        return df.where(v, 0)
    else:
        return df

def piRSquared1(df, zones): # @piRSquared's soln1
    zinv = {v: k for k in zones for v in zones[k]}
    return df.groupby(zinv, 1).apply(mask_n, n=2)

def piRSquared2(df, zones): # @piRSquared's soln2
    zinv = {v: k for k in zones for v in zones[k]}
    return df.mask(df.groupby(zinv, 1).rank(axis=1, method='first', 
                   ascending=False) > 2, 0)

def COLDSPEED1(df, zones): # @COLDSPEED's soln
    for z in zones:                   
        df2 = df.iloc[:, zones[z]]
        df.iloc[:, zones[z]] = \
                np.where(((-df2).rank(axis=1) - 1) >= 2, 0, df2.values)
    return df

def s5s1(df, zones, N=2): # @s5s's soln
    final = []
    for zone_id, cols in zones.iteritems():
        values = {}
        d = df[cols]  # zone A
        for i, row in d.iterrows():
            if len(row) > N:
                row.sort()
                row[row.head(len(row) - N).index] = 0
            values[i] = row
        d = pd.DataFrame(values).T
        final.append(d)

    return pd.concat(final, axis=1)[df.columns]
Run Code Online (Sandbox Code Playgroud)

关于更大数据集的计时 -

In [458]: # Setup
     ...: ncols = 1000
     ...: cuts = np.sort(np.random.choice(ncols, ncols//3, replace=0))
     ...: indx_split = np.split(np.arange(ncols),cuts)
     ...: zones = {i:p_i for i,p_i in enumerate(list(map(list,indx_split)))}
     ...: df = pd.DataFrame(np.random.randint(11,99,(10,ncols)))
     ...: N = 2
     ...: 
     ...: df1 = df.copy()
     ...: df2 = df.copy()
     ...: df3 = df.copy()
     ...: df4 = df.copy()
     ...: df5 = df.copy()
     ...: 

In [459]: %timeit COLDSPEED1(df1, zones)
     ...: %timeit piRSquared1(df2, zones)
     ...: %timeit piRSquared2(df3, zones)
     ...: %timeit s5s1(df4, zones)
     ...: %timeit keeptopN_perkey(df5, zones)
     ...: 
1 loop, best of 3: 324 ms per loop
10 loops, best of 3: 116 ms per loop
10 loops, best of 3: 81.6 ms per loop
1 loop, best of 3: 1.47 s per loop
100 loops, best of 3: 2.99 ms per loop
Run Code Online (Sandbox Code Playgroud)