s5s*_*s5s 3 python performance numpy dataframe pandas
我找不到这个问题的优雅解决方案(可能没有).
我有以下示例DataFrame:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,10)).abs()
0 1 2 3 4 5 6 \
0 1.764052 0.400157 0.978738 2.240893 1.867558 0.977278 0.950088
1 0.144044 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079
2 2.552990 0.653619 0.864436 0.742165 2.269755 1.454366 0.045759
3 0.154947 0.378163 0.887786 1.980796 0.347912 0.156349 1.230291
4 1.048553 1.420018 1.706270 1.950775 0.509652 0.438074 1.252795
5 0.895467 0.386902 0.510805 1.180632 0.028182 0.428332 0.066517
6 0.672460 0.359553 0.813146 1.726283 0.177426 0.401781 1.630198
7 0.729091 0.128983 1.139401 1.234826 0.402342 0.684810 0.870797
8 1.165150 0.900826 0.465662 1.536244 1.488252 1.895889 1.178780
9 0.403177 1.222445 0.208275 0.976639 0.356366 0.706573 0.010500
7 8 9
0 0.151357 0.103219 0.410599
1 0.205158 0.313068 0.854096
2 0.187184 1.532779 1.469359
3 1.202380 0.387327 0.302303
4 0.777490 1.613898 0.212740
5 0.302472 0.634322 0.362741
6 0.462782 0.907298 0.051945
7 0.578850 0.311553 0.056165
8 0.179925 1.070753 1.054452
9 1.785870 0.126912 0.401989
Run Code Online (Sandbox Code Playgroud)
我有以下区域地图:
zones = {"A":[0,1,2],"B":[3,4],"C":[5,6,7,8],"D":[9]}
区域显示我应该一起检查的列组和df [columns] DataFrame的每一行,保留前N个项目(NB:保持前N个项目,即横截面 - 见后面),将其余部分设为零.例如,对于N = 2的区域"A",我将检查以下DataFrame:
0 1 2
0 1.764052 0.400157 0.978738
1 0.144044 1.454274 0.761038
2 2.552990 0.653619 0.864436
3 0.154947 0.378163 0.887786
4 1.048553 1.420018 1.706270
5 0.895467 0.386902 0.510805
6 0.672460 0.359553 0.813146
7 0.729091 0.128983 1.139401
8 1.165150 0.900826 0.465662
9 0.403177 1.222445 0.208275
Run Code Online (Sandbox Code Playgroud)
因为N = 2,我将保留前N项:
0 1 2
0 1.764052 0. 0.978738
1 0. 1.454274 0.761038
2 2.552990 0. 0.864436
3 0. 0.378163 0.887786
4 0. 1.420018 1.706270
5 0.895467 0. 0.510805
6 0.672460 0. 0.813146
7 0.729091 0. 1.139401
8 1.165150 0.900826 0.
9 0.403177 1.222445 0.
Run Code Online (Sandbox Code Playgroud)
上面带有区域图并且N = 2的整个输出将如下所示:
0 1 2 3 4 5 6 \
0 1.764052 0. 0.978738 2.240893 1.867558 0.977278 0.950088
1 0. 1.454274 0.761038 0.121675 0.443863 0.333674 1.494079
2 2.552990 0. 0.864436 0.742165 2.269755 1.454366 0.
3 0. 0.378163 0.887786 1.980796 0.347912 0. 1.230291
4 0. 1.420018 1.706270 1.950775 0.509652 0. 1.252795
5 0.895467 0. 0.510805 1.180632 0.028182 0.428332 0.
6 0.672460 0. 0.813146 1.726283 0.177426 0. 1.630198
7 0.729091 0. 1.139401 1.234826 0.402342 0.684810 0.870797
8 1.165150 0.900826 0. 1.536244 1.488252 1.895889 1.178780
9 0.403177 1.222445 0. 0.976639 0.356366 0.706573 0.
7 8 9
0 0. 0. 0.410599
1 0. 0. 0.854096
2 0. 1.532779 1.469359
3 1.202380 0. 0.302303
4 0. 1.613898 0.212740
5 0. 0.634322 0.362741
6 0. 0.907298 0.051945
7 0. 0. 0.056165
8 0. 0. 1.054452
9 1.785870 0. 0.401989
Run Code Online (Sandbox Code Playgroud)
我试图解决这个问题的方式感觉有点慢.我循环遍历区域,然后我得到一个zone_df,然后我循环遍历行,排序每一行并调用row.head(len(row) - N)以获取需要设置为0的索引和列.然后使用这些值(在dict中)将zone_df中的单元格设置为零,然后组合zone_dfs.
这是一种方式 -
def keeptopN_perkey(df, zones, N=2):
a = df.values
indx = zones.values()
r = np.arange(a.shape[0])[:,None]
for i in indx:
b = a[:,i]
L = np.maximum(len(i)-N,0)
if L>0:
idx = np.argpartition(b, L, axis=1)[:,:L]
# or np.argsort(b,axis=1)[:,:L]
b[r, idx] = 0
a[:,i] = b
return df
Run Code Online (Sandbox Code Playgroud)
好处是我们正在回写输入数据帧,而无需在使用底层数组数据的帮助下创建输出数据帧.
样品运行 -
In [303]: np.random.seed(0)
...: N = 2
...: df = pd.DataFrame(np.random.randint(11,99,(4,10)))
...: zones = {"A": [0,1,2], "B": [3,4], "C": [5, 6,7,8], "D": [9]}
...:
In [304]: df
Out[304]:
0 1 2 3 4 5 6 7 8 9
0 55 58 75 78 78 20 94 32 47 98
1 81 23 69 76 50 98 57 92 48 36
2 88 83 20 31 91 80 90 58 75 93
3 60 40 30 30 25 50 43 76 20 68
In [305]: keeptopN_perkey(df, zones, N=2)
Out[305]:
0 1 2 3 4 5 6 7 8 9
0 0 58 75 78 78 0 94 0 47 98
1 81 0 69 76 50 98 0 92 0 36
2 88 83 0 31 91 80 90 0 0 93
3 60 40 0 30 25 50 0 76 0 68
Run Code Online (Sandbox Code Playgroud)
其他职位的方法 -
def mask_n(df, n): # @piRSquared's helper func
v = np.zeros(df.shape, dtype=bool)
n = min(n, v.shape[1])
if v.shape[1] > n:
j = np.argpartition(-df.values, n, 1)[:, :n].ravel()
i = np.arange(v.shape[0]).repeat(n)
v[i, j] = True
return df.where(v, 0)
else:
return df
def piRSquared1(df, zones): # @piRSquared's soln1
zinv = {v: k for k in zones for v in zones[k]}
return df.groupby(zinv, 1).apply(mask_n, n=2)
def piRSquared2(df, zones): # @piRSquared's soln2
zinv = {v: k for k in zones for v in zones[k]}
return df.mask(df.groupby(zinv, 1).rank(axis=1, method='first',
ascending=False) > 2, 0)
def COLDSPEED1(df, zones): # @COLDSPEED's soln
for z in zones:
df2 = df.iloc[:, zones[z]]
df.iloc[:, zones[z]] = \
np.where(((-df2).rank(axis=1) - 1) >= 2, 0, df2.values)
return df
def s5s1(df, zones, N=2): # @s5s's soln
final = []
for zone_id, cols in zones.iteritems():
values = {}
d = df[cols] # zone A
for i, row in d.iterrows():
if len(row) > N:
row.sort()
row[row.head(len(row) - N).index] = 0
values[i] = row
d = pd.DataFrame(values).T
final.append(d)
return pd.concat(final, axis=1)[df.columns]
Run Code Online (Sandbox Code Playgroud)
关于更大数据集的计时 -
In [458]: # Setup
...: ncols = 1000
...: cuts = np.sort(np.random.choice(ncols, ncols//3, replace=0))
...: indx_split = np.split(np.arange(ncols),cuts)
...: zones = {i:p_i for i,p_i in enumerate(list(map(list,indx_split)))}
...: df = pd.DataFrame(np.random.randint(11,99,(10,ncols)))
...: N = 2
...:
...: df1 = df.copy()
...: df2 = df.copy()
...: df3 = df.copy()
...: df4 = df.copy()
...: df5 = df.copy()
...:
In [459]: %timeit COLDSPEED1(df1, zones)
...: %timeit piRSquared1(df2, zones)
...: %timeit piRSquared2(df3, zones)
...: %timeit s5s1(df4, zones)
...: %timeit keeptopN_perkey(df5, zones)
...:
1 loop, best of 3: 324 ms per loop
10 loops, best of 3: 116 ms per loop
10 loops, best of 3: 81.6 ms per loop
1 loop, best of 3: 1.47 s per loop
100 loops, best of 3: 2.99 ms per loop
Run Code Online (Sandbox Code Playgroud)