如何使用 Tf-idf 特征来训练模型?

mri*_*ank 1 machine-learning scikit-learn text-classification naivebayes tfidfvectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf= True, 
                       min_df = 5, 
                       norm= 'l2', 
                       ngram_range= (1,2), 
                       stop_words ='english')

feature1 = tfidf.fit_transform(df.Rejoined_Stem)
array_of_feature = feature1.toarray()
Run Code Online (Sandbox Code Playgroud)

我使用上面的代码来获取我的文本文档的功能。

from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes on Lemmatized Text
X_train, X_test, y_train, y_test = train_test_split(df['Rejoined_Lemmatize'], df['Product'], random_state = 0)
X_train_counts = tfidf.fit_transform(X_train)
clf = MultinomialNB().fit(X_train_counts, y_train)
y_pred = clf.predict(tfidf.transform(X_test))
Run Code Online (Sandbox Code Playgroud)

然后我使用这段代码来训练我的模型。有人可以解释一下在训练模型时如何使用上述特征,因为在训练时 feature1 变量没有在任何地方使用?

Pra*_*iel 10

不,您没有使用,feature1因为您执行了另一次转换X_train_count

\n\n

让\xe2\x80\x99s 按逻辑流程浏览代码,并仅使用在特征提取和模型训练中使用的变量。

\n\n
# imports used\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.naive_bayes import MultinomialNB\n\n# split data random state 0 and test_size 0.25 default as you did not give the test_size\n\nX_train, X_test, y_train, y_test = train_test_split(df[[\'Rejoined_Lemmatize\']], df[\'Product\'], random_state = 0)\n\n# you initiated your transformer to `fit_transform` X_train, and `transform` X_test\n\ntfidf = TfidfVectorizer(sublinear_tf= True, \n                       min_df = 5, \n                       norm= \'l2\', \n                       ngram_range= (1,2), \n                       stop_words =\'english\')\n\n\nX_train_counts = tfidf.fit_transform(X_train)\nX_test_counts = tfidf.transform(X_test)\n\n# you initiated your model and fit X_train_counts and y_train\nclf = MultinomialNB()\ncls.fit(X_train_counts, y_train)\n\n# you predicted from your transformed features\ny_pred = clf.predict(X_test_counts)\n
Run Code Online (Sandbox Code Playgroud)\n\n

有一种更好的方法来使用 Scikit-learn API,它可以消除混乱并帮助您避免混淆。这种方式使用Pipelines

\n\n
# imports used: see Pipeline\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.naive_bayes import MultinomialNB\n\n# split data random state 0 and test_size 0.25 default as you did not give the test_size\n\nX_train, X_test, y_train, y_test = train_test_split(df[[\'Rejoined_Lemmatize\']], df[\'Product\'], random_state = 0)\n\n# get the params\ntfidf_params = dict(sublinear_tf= True, \n                       min_df = 5, \n                       norm= \'l2\', \n                       ngram_range= (1,2), \n                       stop_words =\'english\')\n\n# create a Pipeline that will do features transformation then pass to the model\n\nclf = Pipeline(steps=[\n(\'features\', TfidfVectorizer(**tfidf_params)),\n(\'model\', MultinomialNB())\n])\n\n# Use clf as a model, fit X_train and y_train\ncls.fit(X_train, y_train)\n\n# predicted \ny_pred = clf.predict(X_test)\n
Run Code Online (Sandbox Code Playgroud)\n\n

pipeline 的作用是对.fit数据进行 fit_transform,然后将其传递给模型。在 中.predict,它将在传递给模型之前进行转换。

\n\n

这种方法的最大优点是您可以轻松切换模型或变压器。以下是模型基线比较的示例:

\n\n
# collection to store results \nfrom collections import defaultdict\nimport pandas as pd\n\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\n# models to test\nfrom sklearn.linear_model import PassiveAggressiveClassifier \nfrom sklearn.linear_model import RidgeClassifierCV\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.linear_model import LogisticRegressionCV     \n\n\n# insistent our storage \nbench_mark = defaultdict(list)\n\n# split data random state 0 and test_size 0.25 default as you did not give the test_size\n\nX_train, X_test, y_train, y_test = train_test_split(df[[\'Rejoined_Lemmatize\']], df[\'Product\'], random_state = 0)\n\n# get the transformer params\ntfidf_params = dict(sublinear_tf= True, \n                       min_df = 5, \n                       norm= \'l2\', \n                       ngram_range= (1,2), \n                       stop_words =\'english\')\n\n# list of models we would like to complete \nmodels = [\nPassiveAggressiveClassifier(C=1e-1,max_iter=1e3,  tol=1e3), \nRidgeClassifierCV(scoring=\'roc_auc\', cv=10),\nLogisticRegressionCV(cv=5,solver=\'saga\',scoring=\'accuracy\', random_state=1, n_jobs=-1),\nSGDClassifier(loss=\'log\', random_state=1, max_iter=101),\n ]\n\n# train, test and store each model \nfor model in models:\n\n    # our pipeline is changed to accept model\n    clf = Pipeline(steps=[\n        (\'features\', TfidfVectorizer(**tfidf_params)),\n        (\'model\', model) #just model not model() as we have done that in models list\n    ])\n\n    clf.fit(X_train,y_train)\n     score = clf.score(X_test,y_test)\n\n    model_name = clf.named_steps[\'model\'].__class__.__name__ # hack to get name\n\n    model_params = clf.named_steps[\'model\']. get_params()\n\n\n    print(f\'{model_name} Scored: {score:.3f}\\n\')\n\n    bench_mark[\'model_name\'].append(model_name)\n    bench_mark[\'score\'].append(score)\n    bench_mark[\'model\'].append(clf)\n    bench_mark[\'used_params\'].append(model_params)\n\n# in the end, place the bench_mark to DataFrame\nmodels_df = pd.DataFrame(bench_mark)\n\n# now you have the trained modes in DataFrame, their scores and parameters. \n#You can access and use any model.\n\nlogistic_reg = models_df[models_df[\'model_name\']==\'LogisticRegressionCV\'][\'model\'].iloc[0]\n\ny_preds = logistic_reg.predict(X_test)\n
Run Code Online (Sandbox Code Playgroud)\n\n

希望这有帮助

\n