Kri*_*shh 3 python machine-learning pandas scikit-learn
我有下面的数据框
df[['row_num','set_id']].head()
row_num path_id_set
988681 [31672, 0]
988680 [31965, 0]
988679 [0, 78464]
Run Code Online (Sandbox Code Playgroud)
我正在尝试使用多标签二值化器,但失败并出现错误代码 float object not iterable
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(df['set_id'].str.split(','))
TypeError: 'float' object is not iterable
Run Code Online (Sandbox Code Playgroud)
我认为问题是缺少值,您可以使用:
print (df)
row_num set_id
0 988681 NaN
1 988680 [31965,0]
2 988679 [0,78464]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
#create boolean mask matched non NaNs values
mask = df['set_id'].notnull()
#filter by boolean indexing
arr = mlb.fit_transform(df.loc[mask, 'set_id'].dropna().str.strip('[]').str.split(','))
#create DataFrame and add missing (NaN)s index values
df = (pd.DataFrame(arr, index=df.index[mask], columns=mlb.classes_)
.reindex(df.index, fill_value=0))
print (df)
0 31965 78464
0 0 0 0
1 1 1 0
2 1 0 1
Run Code Online (Sandbox Code Playgroud)