sklearn 中的多列单热编码和命名列

Question

sklearn 中的多列单热编码和命名列

Gid*_*per 4 python python-3.x pandas scikit-learn one-hot-encoding

我有以下代码可以对我拥有的 2 列进行单热编码。

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

Run Code Online (Sandbox Code Playgroud)

我想知道的是如何在 4 行中执行此操作，同时在输出中正确命名列。也就是说，我可以通过包含两个列名来创建一个正确的单热编码数组，fit_transform但是当我尝试命名结果数据框的列时，它告诉我索引的形状之间存在不匹配：

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

Run Code Online (Sandbox Code Playgroud)

对于背景，电话和城市都有 3 个值。

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

Run Code Online (Sandbox Code Playgroud)

Answer 1

Max*_*Kan 12

你快到了......就像你说的那样，你可以直接添加所有要编码的列fit_transform。

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

Run Code Online (Sandbox Code Playgroud)

然后你只需要执行以下操作：

feature_labels = np.array(feature_labels).ravel()

Run Code Online (Sandbox Code Playgroud)

这使您可以根据需要命名列：

features = pd.DataFrame(feature_arr, columns=feature_labels)

Run Code Online (Sandbox Code Playgroud)

@VitorGonçalves 发生这种情况是因为从“fit_transform”返回的数据集在转换后有 7 列，因此 Pandas 期望“feature_labels”数组中有 7 个相应的标签与数据集匹配，但它只有 2 个元素。要修复此错误，请将“feature_labels = ohe.categories_”替换为“feature_labels = ohe.get_feature_names()” (3认同)
@MaximeKan 我在使用具有多个功能的数据帧创建时遇到问题，它返回一个错误传递值的形状是 (10692, 7)，索引暗示 (10692, 2)，我必须执行以下操作手动添加标签，如何解决这个问题 (2认同)

归档时间：	6 年，10 月前
查看次数：	17197 次
最近记录：	4 年，9 月前