如何一个热编码变体长度功能？

Question

如何一个热编码变体长度功能？

Zel*_*ong 8 python numpy pandas scikit-learn

给出变体长度特征列表:

features = [
    ['f1', 'f2', 'f3'],
    ['f2', 'f4', 'f5', 'f6'],
    ['f1', 'f2']
]

Run Code Online (Sandbox Code Playgroud)

其中,每个样品具有特征的变体数量和特征dtype是str和已经一热.

为了使用sklearn的特征选择实用程序,我必须将其转换features为2D数组,如下所示:

    f1  f2  f3  f4  f5  f6
s1   1   1   1   0   0   0
s2   0   1   0   1   1   1
s3   1   1   0   0   0   0

Run Code Online (Sandbox Code Playgroud)

我怎么能通过sklearn或numpy实现它？

Answer 1

Viv*_*mar 11

您可以使用scikit中存在的MultiLabelBinarizer,它专门用于执行此操作.

您的示例代码:

features = [
            ['f1', 'f2', 'f3'],
            ['f2', 'f4', 'f5', 'f6'],
            ['f1', 'f2']
           ]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)

Run Code Online (Sandbox Code Playgroud)

输出:

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 0]])

Run Code Online (Sandbox Code Playgroud)

这也可以在管道中使用,以及其他feature_selection实用程序.

归档时间：	8 年，8 月前
查看次数：	1311 次
最近记录：	8 年，8 月前