如何在某一列的一个单元格内拆分项目并实现one-hot编码?

Cec*_*lia 1 python pandas scikit-learn

我有一个 df,其中一列如下所示:

              channels  
0       [email, mobile, social]         
1  [web, email, mobile, social]         
2          [web, email, mobile]             
3          [web, email, mobile]             
4                  [web, email]            
5  [web, email, mobile, social]      
6  [web, email, mobile, social]         
7       [email, mobile, social]        
8  [web, email, mobile, social]            
9          [web, email, mobile]  
Run Code Online (Sandbox Code Playgroud)

如何拆分每个单元格中的每个项目,以便我可以实现单热编码?

我试过:

portfolio.channels.str.split(expand=True)

Return:
      0
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
Run Code Online (Sandbox Code Playgroud)

yat*_*atu 5

您可以explode将列,然后将一个热码。这是使用前两行的示例:

from sklearn.preprocessing import OneHotEncoder

print(df)
                        chanels
0       [email, mobile, social]
1  [web, email, mobile, social]

# explode column of lists
df_exploded = df.chanels.explode()
# input data to encode, it must be 2D hence the reshape
X = df_exploded.to_numpy()[:,None]
# fit and transform the one hot encoder
oh = OneHotEncoder()
oh.fit(X)
pd.DataFrame(oh.transform(X).todense(), columns=oh.get_feature_names())

   x0_email  x0_mobile  x0_social  x0_web
0       1.0        0.0        0.0     0.0
1       0.0        1.0        0.0     0.0
2       0.0        0.0        1.0     0.0
3       0.0        0.0        0.0     1.0
4       1.0        0.0        0.0     0.0
5       0.0        1.0        0.0     0.0
6       0.0        0.0        1.0     0.0
Run Code Online (Sandbox Code Playgroud)

从这里我们可以通过按爆炸系列的索引分组并添加为:

df_encoded = pd.DataFrame(oh.transform(X).todense(), 
                          columns=oh.get_feature_names())
df_encoded.groupby(df_exploded.index).sum()

   x0_email  x0_mobile  x0_social  x0_web
0       1.0        1.0        1.0     0.0
1       1.0        1.0        1.0     1.0
Run Code Online (Sandbox Code Playgroud)


Ben*_*n.T 5

您可以使用MultiLabelBinarizersklearn 中的。

from sklearn.preprocessing import MultiLabelBinarizer

#create the MultiLabelBinarizer and fit_trasnform your data (only first 3 rows here)
mlb = MultiLabelBinarizer()
a = mlb.fit_transform(df.channels.to_numpy())

#create the dataframe with columns names being the 
df_ohe = pd.DataFrame(a,df.index, mlb.classes_)

print (df_ohe)
   email  mobile  social  web
0      1       1       1    0
1      1       1       1    1
2      1       1       0    1
Run Code Online (Sandbox Code Playgroud)


WeN*_*Ben 5

解决方案来自 pandas

df['channels'].explode().str.get_dummies().sum(level=0)
Run Code Online (Sandbox Code Playgroud)

  • @yatu 我发帖比Ben.T晚,改成pandasexplode~ (2认同)