Cec*_*lia 1 python pandas scikit-learn
我有一个 df,其中一列如下所示:
channels
0 [email, mobile, social]
1 [web, email, mobile, social]
2 [web, email, mobile]
3 [web, email, mobile]
4 [web, email]
5 [web, email, mobile, social]
6 [web, email, mobile, social]
7 [email, mobile, social]
8 [web, email, mobile, social]
9 [web, email, mobile]
Run Code Online (Sandbox Code Playgroud)
如何拆分每个单元格中的每个项目,以便我可以实现单热编码?
我试过:
portfolio.channels.str.split(expand=True)
Return:
0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Run Code Online (Sandbox Code Playgroud)
您可以explode将列,然后将一个热码。这是使用前两行的示例:
from sklearn.preprocessing import OneHotEncoder
print(df)
chanels
0 [email, mobile, social]
1 [web, email, mobile, social]
# explode column of lists
df_exploded = df.chanels.explode()
# input data to encode, it must be 2D hence the reshape
X = df_exploded.to_numpy()[:,None]
# fit and transform the one hot encoder
oh = OneHotEncoder()
oh.fit(X)
pd.DataFrame(oh.transform(X).todense(), columns=oh.get_feature_names())
x0_email x0_mobile x0_social x0_web
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0
4 1.0 0.0 0.0 0.0
5 0.0 1.0 0.0 0.0
6 0.0 0.0 1.0 0.0
Run Code Online (Sandbox Code Playgroud)
从这里我们可以通过按爆炸系列的索引分组并添加为:
df_encoded = pd.DataFrame(oh.transform(X).todense(),
columns=oh.get_feature_names())
df_encoded.groupby(df_exploded.index).sum()
x0_email x0_mobile x0_social x0_web
0 1.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0
Run Code Online (Sandbox Code Playgroud)
您可以使用MultiLabelBinarizersklearn 中的。
from sklearn.preprocessing import MultiLabelBinarizer
#create the MultiLabelBinarizer and fit_trasnform your data (only first 3 rows here)
mlb = MultiLabelBinarizer()
a = mlb.fit_transform(df.channels.to_numpy())
#create the dataframe with columns names being the
df_ohe = pd.DataFrame(a,df.index, mlb.classes_)
print (df_ohe)
email mobile social web
0 1 1 1 0
1 1 1 1 1
2 1 1 0 1
Run Code Online (Sandbox Code Playgroud)
解决方案来自 pandas
df['channels'].explode().str.get_dummies().sum(level=0)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
70 次 |
| 最近记录: |