M S*_*ers 3 python dataframe pandas
我有一个数据帧,其中pic_code值可能重复.如果重复,我想为最接近其mpe_wgt的pic_code设置变量"keep"为"t".
例如,第二pic_code将"keep"设置为t,因为它具有最接近其对应的"mpe_weight"的"权重".我的代码导致"保持"为所有人保持'f'并且"差异"为所有人保持"100".
df['keep']='f'
df['diff']=100
def cln_df(data):
if pd.unique(data['mpe_wgt']).shape==(1,):
data['keep'][0:1]='t'
elif pd.unique(data['mpe_wgt']).shape!=(1,):
data['diff']=abs(data['weight']-(data['mpe_wgt']/100))
data['keep'][data['diff']==min(data['diff'])]='t'
return data
df=df.groupby('pic_code').apply(cln_df)
Run Code Online (Sandbox Code Playgroud)
df之前
pic_code weight mpe_wgt keep diff
1234 45 34 f 100
1234 32 23 f 100
45344 54 35 f 100
234 76 98 f 100
234 65 12 f 100
Run Code Online (Sandbox Code Playgroud)
df输出应该是
pic_code weight mpe_wgt keep diff
1234 45 34 f 11
1234 32 23 t 9
45344 54 35 t 100
234 76 98 t 22
234 65 12 f 53
Run Code Online (Sandbox Code Playgroud)
我是python的新手,所以请保持解决方案尽可能简单.我真的想让我的方法工作所以请不要太花哨.在此先感谢您的帮助.
这是一种方式.注意我使用布尔值True
/ False
代替字符串"t"
和"f"
.这只是一种很好的做法.
请注意,以下所有操作都是矢量化的,而groupby.apply
使用自定义功能肯定不是.
建立
print(df)
pic_code weight mpe_wgt
0 1234 45 34
1 1234 32 23
2 45344 54 35
3 234 76 98
4 234 65 12
Run Code Online (Sandbox Code Playgroud)
解
# calculate difference
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
# sort by pic_code, then by diff
df = df.sort_values(['pic_code', 'diff'])
# define keep column as True only for non-duplicates by pic_code
df['keep'] = ~df.duplicated('pic_code')
Run Code Online (Sandbox Code Playgroud)
结果
print(df)
pic_code weight mpe_wgt diff keep
3 234 76 98 22 True
4 234 65 12 53 False
1 1234 32 23 9 True
0 1234 45 34 11 False
2 45344 54 35 19 True
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
66 次 |
最近记录: |