我目前有一个如下所示的数据框:
Idnumber Ownership Date
1 100 2006
2 >50 2006
1 80 2007
3 NaN 2006
Run Code Online (Sandbox Code Playgroud)
所有权列当前为浮动类型。我想要的是在 idnumber 上创建一个 groupby 函数,该函数返回每个 Idnumber 的最大值。问题是,对于 > < 或 ± 之类的东西,这是不可能的(错误:无法排序的类型:float() >= str())。
df['Ownership'] = df['Ownership'].astype(str)
df['Ownership'] = df['Ownership'].map(lambda x: x.strip('± = > + <'))
df['Ownership'] = df['Ownership'].astype(float).fillna(0.0)
df['Ownershipadjusted']= df['Ownership'].groupby([df['Idnumber'],df['Ownership']]).max()
Run Code Online (Sandbox Code Playgroud)
实际上不会工作,因为将其转换回浮点数会产生错误:无法将字符串转换为浮点数。
df['Ownership'] = df['Ownership'].apply(pd.to_numeric, errors='coerce')
Run Code Online (Sandbox Code Playgroud)
也没有达到要求的效果。是否有一些更直接的方法可以从浮点数中删除符号,或者使这种转换有效?
为了避免混淆,这就是我需要的:
Idnumber Ownership Date Ownership adjusted
1 100 2006 100
2 50 2006 50
1 80 2007 100
3 0 2006 0
Run Code Online (Sandbox Code Playgroud)
当然,数据框包含的观察值远不止 4 个
将dtypetostr和extract数字投射dtype到float:
In [215]:
df['Ownership'] = df['Ownership'].astype(str).str.extract('(\d+)').astype(float)
df
Out[215]:
Idnumber Ownership Date
0 1 100 2006
1 2 50 2006
2 1 80 2007
3 3 NaN 2006
Run Code Online (Sandbox Code Playgroud)
你的groupby陈述也是错误的,你需要这个:
In [218]:
df['Ownershipadjusted']= df.groupby(['Idnumber'])['Ownership'].transform('max')
df
Out[218]:
Idnumber Ownership Date Ownershipadjusted
0 1 100 2006 100
1 2 50 2006 50
2 1 80 2007 100
3 3 NaN 2006 NaN
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2305 次 |
| 最近记录: |