gla*_*313 1 python classification feature-selection scikit-learn
我使用Python的sklearn随机林(ensemble.RandomForestClassifier)进行分类,并feature_importances_用于查找分类器的重要功能.现在我的代码是:
for trip in database:
venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
else:
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier...")
clf.fit(train_data,labels_train)
importances = clf.feature_importances_
Run Code Online (Sandbox Code Playgroud)
现在的问题是,当我使用feature_importances时,我得到一个维度为580的数组(与特征维度相同),我想知道前20个重要特征(前20个重要场所)
我认为至少我应该知道的是来自重要数据的20个最大数字的指数,但我不知道:
如何从顶部20指数的重要性有关
因为我使用了Dictvectorizer和TfidfTransformer所以我不知道如何将索引与真实的场地名称('school','home',....)相匹配
有什么好主意帮我吗?非常感谢你!
Jar*_*ber 11
要获得每个功能名称的重要性,只需将列名称和 feature_importances 一起迭代(它们相互映射):
for feat, importance in zip(df.columns, clf.feature_importances_):
print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)
Run Code Online (Sandbox Code Playgroud)
该feature_importances_方法以特征被馈送到算法的顺序返回相对重要性数字.因此,为了获得前20个功能,您需要将功能从最重要到最不重要的功能进行排序,例如:
importances = forest.feature_importances_
indices = numpy.argsort(importances)[-20:]
Run Code Online (Sandbox Code Playgroud)
([-20:]因为你需要采用数组的最后20个元素,因为argsort按升序排序)
| 归档时间: |
|
| 查看次数: |
5247 次 |
| 最近记录: |