LdM*_*LdM 4 python machine-learning feature-selection scikit-learn
我想知道当我使用分类器时是否,例如:
random_forest_bow = Pipeline([
('rf_tfidf',Feat_Selection. countV),
('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
])
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])
Run Code Online (Sandbox Code Playgroud)
我也在考虑模型中的其他功能。我定义 X 和 y 如下:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)
Run Code Online (Sandbox Code Playgroud)
我的数据集如下所示
Text is_it_capital? is_it_upper? contains_num? Label
an example of text 0 0 0 0
ANOTHER example of text 1 1 0 1
What's happening?Let's talk at 5 1 0 1 1
Run Code Online (Sandbox Code Playgroud)
我还想将is_it_capital?, is_it_upper?,用作特征contains_num?,但由于它们具有二进制值(编码后为 1 或 0),因此我应该仅在 Text 上应用 BoW 以提取额外的特征。也许我的问题很明显,但由于我是一个新的 ML 学习者并且我不熟悉分类器和编码,我将感谢您提供的所有支持和评论。谢谢
你当然可以用你的“额外”的功能,如is_it_capital?,is_it_upper?和contains_num?。似乎您正在苦苦思索如何准确地结合两个看似不同的功能集。您可以使用sklearn.pipeline.FeatureUnion或sklearn.compose.ColumnTransformer 之类的东西将不同的编码策略应用于每组功能。您没有理由不能将额外的特征与文本特征提取方法(例如您的 BoW 方法)产生的任何组合结合使用。
df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)
print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
[0 0 0 1 1 0 1 1 1 1 0]
[1 0 2 0 0 0 1 1 0 0 1]
[0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']
Run Code Online (Sandbox Code Playgroud)
更多关于你的具体例子:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
])
transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
155 次 |
| 最近记录: |