标签: feature-selection

from sklearn import (cross_validation, feature_selection, pipeline,
                     preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)

scores = []
for k, (train, test) in enumerate(split):

    X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]

    top_feat = feature_selection.SelectKBest()

    pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
                                 ('feat', top_feat),
                                 ('clf', linear_model.LogisticRegression())])

    K = [40, 60, 80, 100]
    C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
    penalty = ['l1', 'l2']

    param_grid = [{'feat__k': K,
                  'clf__C': C,
                  'clf__penalty': penalty}]

    scoring …

Run Code Online (Sandbox Code Playgroud)

pipeline feature-selection scikit-learn

fig*_*ggy

2015 10-28

6
推荐指数

3
解决办法

7472
查看次数

python spark:使用PCA缩小大多数相关功能

我正在使用带有python的spark 2.2.我正在使用ml.feature模块中的PCA.我正在使用VectorAssembler将我的功能提供给PCA.为了澄清,假设我有一个包含三列col1,col2和col3的表,那么我正在做:

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

Run Code Online (Sandbox Code Playgroud)

这时我运行了2个组件的PCA,我可以看看它的值:

m = model.pc.values.reshape(3, 2)

Run Code Online (Sandbox Code Playgroud)

它对应于3(=我原始表中的列数)行和2(=我的PCA中的组件数)列.我的问题是这里的三行是否与我在上面的向量汇编程序中指定输入列的顺序相同？为进一步澄清,上述矩阵对应于:

          | PC1 | PC2 |
 ---------|-----|-----|
    col1  |     |     |
 ---------|-----|-----|
    col2  |     |     |
 ---------|-----|-----|
    col3  |     |     |
 ---------+-----+-----+

Run Code Online (Sandbox Code Playgroud)

请注意,此处的示例仅为了清楚起见.在我真正的问题中,我正在处理~1600列和一堆选择.我在spark文档中找不到任何明确的答案.我想这样做从原始表中选择最佳列/功能,以根据主要组件训练我的模型.还是有别的/更好的火花ML PCA,我应该看着推断这样的结果？

或者我不能使用PCA,并且必须使用其他技术,如spearman排名等？

machine-learning pca feature-selection apache-spark pyspark

Sam*_*jan

2018 02-01

6
推荐指数

1
解决办法

682
查看次数

情感分析管道，使用特征选择时无法获得正确的特征名称

在下面的示例中，我使用Twitter数据集执行情感分析。我使用sklearn管道执行一系列转换，添加功能并添加分类器。最后一步是可视化具有较高预测能力的单词。当我不使用功能选择时，它工作正常。但是，当我使用它时，得到的结果毫无意义。我怀疑在应用特征选择时，文本特征的顺序会发生变化。有办法解决这个问题吗？

以下代码已更新，以包含正确的答案

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

features= [c for c in df.columns.values if c  not in ['target']]
target = 'target'

#train test split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2,stratify = df5[target], random_state=0)

#Create classes which allow to select specific columns from the dataframe

class NumberSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class TextSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key …

Run Code Online (Sandbox Code Playgroud)

python pipeline tf-idf feature-selection scikit-learn

Sta*_*kos

2019 07-11

6
推荐指数

1
解决办法

184
查看次数

获取 pandas 数据框中最大条目的行和列名称（argmax）

df.idxmax() 返回沿轴（行或列）的最大值，但我希望 arg_max(df) 在整个数据帧上，它返回一个元组（行，列）。

我想到的用例是特征选择，其中我有一个相关矩阵，并且想要“递归”删除具有最高相关性的特征。我对相关矩阵进行预处理以考虑其绝对值并将对角线元素设置为 -1。然后我建议使用rec_drop，它递归地删除具有最高相关性的特征对中的一个（受到截止值：max_allowed_correlation），并返回最终的特征列表。例如：

S = S.abs()
np.fill_diagonal(S.values,-1) # so that max can't be on the diagonal now
S = rec_drop(S,max_allowed_correlation=0.95)

def rec_drop(S, max_allowed_correlation=0.99):
    max_corr = S.max().max()
    if max_corr<max_allowed_correlation: # base case for recursion
         return S.columns.tolist() 
    row,col = arg_max(S)  # row and col are distinct features - max can't be on the diagonal
    S = S.drop(row).drop(row,axis=1) # removing one of the features from S
    return rec_drop(S, max_allowed_correlation)

Run Code Online (Sandbox Code Playgroud)

python numpy feature-selection pandas

kus*_*ush

2014 12-03

5
推荐指数

1
解决办法

4078
查看次数

如何在随机森林模型训练中最好地使用邮政编码？

我有一个带有邮政编码列的数据集。它们在输出中具有一定的意义，我想将其用作一项功能。我正在使用随机森林模型。

我需要有关使用邮政编码列作为功能的最佳方法的建议。（例如，我应该获取该邮政编码的纬度/经度，而不是直接输入邮政编码等）

提前致谢！！

zipcode machine-learning feature-selection random-forest h2o

Mar*_*tel

2018 09-12

5
推荐指数

1
解决办法

3466
查看次数

为什么 R 中的 grpreg 库和 gglasso 库对于 LASSO 组给出不同的结果？

我一直在尝试使用 LASSO 进行无监督特征选择（通过删除类列）。数据集包括分类（因子）和连续（数字）变量。链接在这里。我构建了一个设计矩阵，使用model.matrix()它为每个级别的分类变量创建虚拟变量。

dataset <- read.xlsx("./hepatitis.data.xlsx", sheet = "hepatitis", na.strings = "")
names_df <- names(dataset)
formula_LASSO <- as.formula(paste("~ 0 +", paste(names_df, collapse = " + ")))
LASSO_df <- model.matrix(object = formula_LASSO, data = dataset, contrasts.arg = lapply(dataset[ ,sapply(dataset, is.factor)], contrasts, contrasts = FALSE ))

### Group LASSO using gglasso package
gglasso_group <- c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, …

Run Code Online (Sandbox Code Playgroud)

r lasso-regression feature-selection unsupervised-learning

Meh*_*rim

2023 01-24

5
推荐指数

1
解决办法

1125
查看次数

使用 RFECV 和排列重要性的正确方法 - Sklearn

Sklearn 在#15075中有一个实现此功能的提案，但与此同时，eli5建议将其作为解决方案。但是，我不确定我是否以正确的方式使用它。这是我的代码：

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import eli5
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
perm = eli5.sklearn.PermutationImportance(estimator,  scoring='r2', n_iter=10, random_state=42, cv=3)
selector = RFECV(perm, step=1, min_features_to_select=1, scoring='r2', cv=3)
selector = selector.fit(X, y)
selector.ranking_

Run Code Online (Sandbox Code Playgroud)

有几个问题：

我不确定我是否以正确的方式使用交叉验证。PermutationImportance用于cv验证验证集的重要性，或者交叉验证应该仅使用RFECV? （在示例中，我cv=3在两种情况下都使用了，但不确定这是否是正确的做法）
如果我运行eli5.show_weights(perm)，我会得到：AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'。这是因为我适合使用吗RFECV？我正在做的事情与这里的最后一个片段类似： https: //eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
cv作为一个不太重要的问题，当我设置时，这给了我一个警告eli5.sklearn.PermutationImportance：