如何根据其他列的值估算NaN值？

Question

如何根据其他列的值估算NaN值？

sto*_*ock 5 python pandas

我在数据框中有2列

1）工作经验（年）

2）company_type

我想根据工作经验列来估算company_type列。company_type列具有要根据工作经验列填写的NaN值。工作经验列没有任何缺失值。

这里work_exp是数字数据，company_type是类别数据。

示例数据：

Work_exp      company_type
   10            PvtLtd
   0.5           startup
   6           Public Sector
   8               NaN
   1             startup
   9              PvtLtd
   4               NaN
   3           Public Sector
   2             startup
   0               NaN

Run Code Online (Sandbox Code Playgroud)

我已经决定了估算NaN值的阈值。

Startup if work_exp < 2yrs
Public sector if work_exp > 2yrs and <8yrs
PvtLtd if work_exp >8yrs

Run Code Online (Sandbox Code Playgroud)

基于上述阈值标准，我该如何在company_type列中估算缺少的分类值。

Answer 1

jpp*_*jpp 4

您可以numpy.select使用numpy.where：

# define conditions and values
conditions = [df['Work_exp'] < 2, df['Work_exp'].between(2, 8), df['Work_exp'] > 8]
values = ['Startup', 'PublicSector', 'PvtLtd']

# apply logic where company_type is null
df['company_type'] = np.where(df['company_type'].isnull(),
                              np.select(conditions, values),
                              df['company_type'])

print(df)

   Work_exp  company_type
0      10.0        PvtLtd
1       0.5       startup
2       6.0  PublicSector
3       8.0  PublicSector
4       1.0       startup
5       9.0        PvtLtd
6       4.0  PublicSector
7       3.0  PublicSector
8       2.0       startup
9       0.0       Startup

Run Code Online (Sandbox Code Playgroud)

pd.Series.between默认情况下包括开始值和结束值，并允许在float值之间进行比较。使用inclusive=False参数来省略边界。

s = pd.Series([2, 2.5, 4, 4.5, 5])

s.between(2, 4.5)

0     True
1     True
2     True
3     True
4    False
dtype: bool

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，5 月前
查看次数：	964 次
最近记录：	7 年，5 月前