Nel*_*son 6 python scikit-learn
Scikitlearn的PolynomialFeatures有助于生成多项式特征.
这是一个简单的例子:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
# Example data:
X = np.arange(6).reshape(3, 2)
# Works fine
poly = PolynomialFeatures(2)
pd.DataFrame(poly.fit_transform(X))
0 1 2 3 4 5
0 1 0 1 0 0 1
1 1 2 3 4 6 9
2 1 4 5 16 20 25
Run Code Online (Sandbox Code Playgroud)
问题:是否有能力仅将多项式变换应用于指定的要素列表?
例如
# Use previous dataframe
X2 = X.copy()
# Categorical feature will be handled
# by a one hot encoder in another feature generation step
X2['animal'] = ['dog', 'dog', 'cat']
# Don't try to poly transform the animal column
poly2 = PolynomialFeatures(2, cols=[1,2]) # <-- ("cols" not an actual param)
# desired outcome:
pd.DataFrame(poly2.fit_transform(X))
0 1 2 3 4 5 'animal'
0 1 0 1 0 0 1 'dog'
1 1 2 3 4 6 9 'dog'
2 1 4 5 16 20 25 'cat'
Run Code Online (Sandbox Code Playgroud)
当使用Pipeline功能组合一系列长的特征生成和模型训练代码时,这将特别有用.
一种选择是自己动手变换器(Michelle Fullwood的很好的例子),但我认为其他人之前会偶然发现这个用例.
PolynomialFeatures,像sklearn许多其它变压器,没有指定应用数据的哪一列(S)的参数,所以它不是简单的把它放在一个管道,并期待工作.
更常见的方法是,您可以使用FeatureUnion,并使用另一个管道为数据框中的每个功能指定转换器.
一个简单的例子可能是:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X = pd.DataFrame({'cat_var': ['a', 'b', 'c'], 'num_var': [1, 2, 3]})
class ColumnExtractor(object):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
X_cols = X[self.columns]
return X_cols
pipeline = Pipeline([
('features', FeatureUnion([
('num_var', Pipeline([
('extract', ColumnExtractor(columns=['num_var'])),
('poly', PolynomialFeatures(degree=2))
])),
('cat_var', Pipeline([
('extract', ColumnExtractor(columns=['cat_var'])),
('le', LabelEncoder()),
('ohe', OneHotEncoder()),
]))
])),
('estimator', LogisticRegression())
])
Run Code Online (Sandbox Code Playgroud)
是的,有,查看sklearn-pandas
这应该可行(应该有一个更优雅的解决方案,但现在无法测试):
from sklearn.preprocessing import PolynomialFeatures
from sklearn_pandas import DataFrameMapper
X2.columns = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'animal']
mapper = DataFrameMapper([
('col0', PolynomialFeatures(2)),
('col1', PolynomialFeatures(2)),
('col2', PolynomialFeatures(2)),
('col3', PolynomialFeatures(2)),
('col4', PolynomialFeatures(2)),
('col5', PolynomialFeatures(2)),
('Animal', None)])
X3 = mapper.fit_transform(X2)
Run Code Online (Sandbox Code Playgroud)
回应彭俊煌的回答-这种方法很棒,但是实现存在问题。(这应该是一个评论,但要花点时间。此外,没有足够的Cookie。)
我尝试使用该代码,但遇到了一些问题。经过一番鬼混之后,我找到了原始问题的以下答案。主要问题是ColumnExtractor需要继承自BaseEstimator和TransformerMixin才能将其转换为可与其他sklearn工具一起使用的估计器。
我的示例数据显示了两个数值变量和一个分类变量。我使用pd.get_dummies进行了一次热编码,以使管道更加简单。另外,由于没有y合适的数据,我省略了管道的最后阶段(估计器)。重点是显示选择,单独处理和加入。
请享用。
M.
import pandas as pd
import numpy as np
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = pd.DataFrame({'cat': ['a', 'b', 'c'], 'n1': [1, 2, 3], 'n2':[5, 7, 9] })
cat n1 n2
0 a 1 5
1 b 2 7
2 c 3 9
# original version had class ColumnExtractor(object)
# estimators need to inherit from these classes to play nicely with others
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
X_cols = X[self.columns]
return X_cols
# Using pandas get dummies to make pipeline a bit simpler by
# avoiding one-hot and label encoder.
# Build the pipeline from a FeatureUnion that processes
# numerical and one-hot encoded separately.
# FeatureUnion puts them back together when it's done.
pipe2nvars = Pipeline([
('features', FeatureUnion([('num',
Pipeline([('extract',
ColumnExtractor(columns=['n1', 'n2'])),
('poly',
PolynomialFeatures()) ])),
('cat_var',
ColumnExtractor(columns=['cat_b','cat_c']))])
)])
# now show it working...
for p in range(1, 4):
pipe2nvars.set_params(features__num__poly__degree=p)
res = pipe2nvars.fit_transform(pd.get_dummies(X, drop_first=True))
print('polynomial degree: {}; shape: {}'.format(p, res.shape))
print(res)
polynomial degree: 1; shape: (3, 5)
[[1. 1. 5. 0. 0.]
[1. 2. 7. 1. 0.]
[1. 3. 9. 0. 1.]]
polynomial degree: 2; shape: (3, 8)
[[ 1. 1. 5. 1. 5. 25. 0. 0.]
[ 1. 2. 7. 4. 14. 49. 1. 0.]
[ 1. 3. 9. 9. 27. 81. 0. 1.]]
polynomial degree: 3; shape: (3, 12)
[[ 1. 1. 5. 1. 5. 25. 1. 5. 25. 125. 0. 0.]
[ 1. 2. 7. 4. 14. 49. 8. 28. 98. 343. 1. 0.]
[ 1. 3. 9. 9. 27. 81. 27. 81. 243. 729. 0. 1.]]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2945 次 |
| 最近记录: |