Zak*_*yke 12 python feature-selection pandas scikit-learn output
在从一组数据上运行Scikit-Learn的方差阈值后,它会删除一些功能.我觉得我在做一些简单而又愚蠢的事情,但我想保留其余功能的名称.以下代码:
def VarianceThreshold_selector(data):
selector = VarianceThreshold(.5)
selector.fit(data)
selector = (pd.DataFrame(selector.transform(data)))
return selector
x = VarianceThreshold_selector(data)
print(x)
Run Code Online (Sandbox Code Playgroud)
更改以下数据(这只是行的一小部分):
Survived Pclass Sex Age SibSp Parch Nonsense
0 3 1 22 1 0 0
1 1 2 38 1 0 0
1 3 2 26 0 0 0
Run Code Online (Sandbox Code Playgroud)
进入这个(再次只是行的一小部分)
0 1 2 3
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Run Code Online (Sandbox Code Playgroud)
使用get_support方法,我知道这些是Pclass,Age,Sibsp和Parch,所以我宁愿返回更像:
Pclass Age Sibsp Parch
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Run Code Online (Sandbox Code Playgroud)
是否有捷径可寻?我是Scikit Learn的新手,所以我可能只是做些傻事.
Jar*_*rad 19
这样的事情有帮助吗?如果你传递一个pandas数据帧,它将获得列,并get_support像你提到的那样使用它们的索引迭代列列表,只拉出符合方差阈值的列标题.
>>> df
Survived Pclass Sex Age SibSp Parch Nonsense
0 0 3 1 22 1 0 0
1 1 1 2 38 1 0 0
2 1 3 2 26 0 0 0
>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
>>> variance_threshold_selector(df, 0.5)
Pclass Age
0 3 22
1 1 38
2 3 26
>>> variance_threshold_selector(df, 0.9)
Age
0 22
1 38
2 26
>>> variance_threshold_selector(df, 0.1)
Survived Pclass Sex Age SibSp
0 0 3 1 22 1
1 1 1 2 38 1
2 1 3 2 26 0
Run Code Online (Sandbox Code Playgroud)
可能有更好的方法来做到这一点,但对于那些感兴趣的人我是怎么做的:
def VarianceThreshold_selector(data):
#Select Model
selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples
#Fit the Model
selector.fit(data)
features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
features = [column for column in data[features]] #Array of all nonremoved features' names
#Format and Return
selector = pd.DataFrame(selector.transform(data))
selector.columns = features
return selector
Run Code Online (Sandbox Code Playgroud)
我来这里是为了寻找一种获取transform()或fit_transform()返回数据帧的方法,但是我怀疑它不受支持。
但是,您可以像这样更整洁地子集数据:
data_transformed = data.loc[:, selector.get_support()]
Run Code Online (Sandbox Code Playgroud)