标签: feature-selection

如何确定RandomForestClassifier中的feature_importances？

我有一个分类任务,时间序列作为数据输入,其中每个属性(n = 23)代表一个特定的时间点.除了绝对分类结果我想知道,哪些属性/日期对结果的贡献程度如何.所以我只是使用它feature_importances_,这对我很有用.

但是,我想知道如何计算它们以及使用哪种度量/算法.很遗憾,我找不到有关此主题的任何文档.

feature-selection random-forest scikit-learn

use*_*670

2015 10-23

114
推荐指数

4
解决办法

6万
查看次数

在Scikit Learn中运行SelectKBest后获取功能名称的最简单方法

我想进行有监督的学习.

到现在为止,我知道要对所有功能进行监督学习.

但是,我还想进行K最佳功能的实验.

我阅读了文档,发现在Scikit中学习了SelectKBest方法.

不幸的是,我不确定在找到这些最佳功能后如何创建新的数据帧:

我们假设我想进行5个最佳功能的实验:

from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

Run Code Online (Sandbox Code Playgroud)

现在,如果我要添加下一行:

dataframe = pd.DataFrame(select_k_best_classifier)

Run Code Online (Sandbox Code Playgroud)

我将收到一个没有功能名称的新数据帧(只有索引从0到4开始).

我应该把它替换为:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)

Run Code Online (Sandbox Code Playgroud)

我的问题是如何创建features_names列表？

我知道我应该使用:select_k_best_classifier.get_support()

返回布尔值数组.

数组中的真值表示右列中的索引.

我应该如何使用这个布尔数组与我可以通过该方法获得的所有功能名称的数组:

feature_names = list(features_dataframe.columns.values)

Run Code Online (Sandbox Code Playgroud)

python feature-selection pandas scikit-learn

Avi*_*ade

2017 07-30

38
推荐指数

4
解决办法

3万
查看次数

使用字符串/分类特征(变量)进行线性回归分析？

回归算法似乎正在处理以数字表示的特征.例如:

此数据集不包含分类要素/变量.很清楚如何对这些数据进行回归并预测价格.

但现在我想对包含分类特征的数据进行回归分析:

有5个特点:District,Condition,Material,Security,Type

如何对这些数据进行回归？我是否必须手动将所有这些字符串/分类数据转换为数字？我的意思是,如果我必须创建一些编码规则,并根据该规则将所有数据转换为数值.有没有简单的方法将字符串数据转换为数字而无需手动创建自己的编码规则？可能有一些Python中的库可用于此吗？由于"编码错误",回归模型是否存在某些风险？

python regression machine-learning linear-regression feature-selection

Erb*_*yev

2017 11-19

36
推荐指数

4
解决办法

4万
查看次数

TypeError:只有具有一个元素的整数数组才能转换为索引

使用交叉验证执行递归功能选择时出现以下错误:

Traceback (most recent call last):
  File "/Users/.../srl/main.py", line 32, in <module>
    argident_sys.train_classifier()
  File "/Users/.../srl/identification.py", line 194, in train_classifier
    feat_selector.fit(train_argcands_feats,train_argcands_target)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
    ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index

Run Code Online (Sandbox Code Playgroud)

生成错误的代码如下:

def train_classifier(self):

    # Get the argument candidates
    argcands = self.get_argcands(self.reader)

    # Extract the necessary features from the argument candidates
    train_argcands_feats = []
    train_argcands_target = []

    for argcand in argcands:
        train_argcands_feats.append(self.extract_features(argcand))
        if argcand["info"]["label"] == "NULL":
            train_argcands_target.append("NULL")
        else: …

Run Code Online (Sandbox Code Playgroud)

python feature-selection scikit-learn

fer*_*vam

2012 09-16

31
推荐指数

2
解决办法

3万
查看次数

理解sklearn中CountVectorizer中的`ngram_range`参数

我对如何在Python中的scikit-learn库中使用ngrams感到有点困惑,具体来说,这个ngram_range参数在CountVectorizer中是如何工作的.

运行此代码:

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

Run Code Online (Sandbox Code Playgroud)

给我:

{'hi ': 0, 'bye': 1, 'run away': 2}

Run Code Online (Sandbox Code Playgroud)

在我明显错误的印象中,我会得到unigrams和bigrams,就像这样:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

Run Code Online (Sandbox Code Playgroud)

我正在使用这里的文档:http: //scikit-learn.org/stable/modules/feature_extraction.html

显然,我对如何使用ngrams的理解存在严重错误.也许这个论点没有效果,或者我对一个真正的二元组有一些概念上的问题!我很难过.如果有人提出建议,我会感激不尽.

更新:
我意识到了我的方式的愚蠢.我的印象是ngram_range会影响词汇,而不是语料库.

python n-gram feature-selection scikit-learn

Mat*_*ien

2014 06-04

27
推荐指数

1
解决办法

3万
查看次数

如何使用scikit-learn PCA减少功能并知道哪些功能被丢弃

我试图在维度为mxn的矩阵上运行PCA,其中m是要素的数量,n是样本的数量.

假设我想保留nf具有最大方差的要素.随着scikit-learn我能够做到这样:

from sklearn.decomposition import PCA

nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)

X_new = pca.transform(X)

Run Code Online (Sandbox Code Playgroud)

现在,我得到一个X_new形状为nx nf 的新矩阵.是否可以知道哪些功能已被丢弃或保留哪些功能？

谢谢

python machine-learning pca feature-selection scikit-learn

gc5*_*gc5

2014 04-25

25
推荐指数

1
解决办法

3万
查看次数

如何计算XGBoost包中的特征分数(/重要性)？

该命令xgb.importance返回由f分数测量的特征重要性图.

这个f分数代表什么,如何计算？

输出: 特征重要性图

python r classification feature-selection xgboost

ish*_*ido

2017 04-26

25
推荐指数

1
解决办法

1万
查看次数

使用scikit-learn进行特征选择

我是机器学习的新手.我正在使用Scikit Learn SVM准备我的数据进行分类.为了选择最好的功能,我使用了以下方法:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Run Code Online (Sandbox Code Playgroud)

由于我的数据集由负值组成,因此出现以下错误:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X) …

Run Code Online (Sandbox Code Playgroud)

python machine-learning chi-squared feature-selection scikit-learn

sar*_*ara

2018 03-22

20
推荐指数

1
解决办法

2万
查看次数