多标签二值化器:浮点对象不可迭代

Kri*_*shh 3 python machine-learning pandas scikit-learn

我有下面的数据框

df[['row_num','set_id']].head()

row_num     path_id_set
988681      [31672, 0]
988680      [31965, 0]
988679      [0, 78464]
Run Code Online (Sandbox Code Playgroud)

我正在尝试使用多标签二值化器,但失败并出现错误代码 float object not iterable

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(df['set_id'].str.split(','))

TypeError: 'float' object is not iterable
Run Code Online (Sandbox Code Playgroud)

jez*_*ael 5

我认为问题是缺少值,您可以使用:

print (df)
   row_num     set_id
0   988681        NaN
1   988680  [31965,0]
2   988679  [0,78464]

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

#create boolean mask matched non NaNs values
mask = df['set_id'].notnull()

#filter by boolean indexing
arr = mlb.fit_transform(df.loc[mask, 'set_id'].dropna().str.strip('[]').str.split(','))

#create DataFrame and add missing (NaN)s index values
df = (pd.DataFrame(arr, index=df.index[mask], columns=mlb.classes_)
               .reindex(df.index, fill_value=0))

print (df)
   0  31965  78464
0  0      0      0
1  1      1      0
2  1      0      1
Run Code Online (Sandbox Code Playgroud)