Pra*_*thy 5 nlp machine-learning scikit-learn text-classification naivebayes
我正在尝试解决文本分类问题.我有一些有限数量的标签可以捕获我的文本数据类别.如果传入的文本数据不适合任何标签,则标记为"其他".在下面的示例中,我构建了一个文本分类器,将文本数据分类为"早餐"或"意大利语".在测试场景中,我包含了几个不适合我用于训练的标签的文本数据.这是我面临的挑战.理想情况下,我希望模型能说 - "其他"用于"我喜欢徒步旅行"和"每个人都应该理解数学".我怎样才能做到这一点?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
Run Code Online (Sandbox Code Playgroud)
我认为'其他'类别是噪音,我无法对此类别进行建模.
我认为Kalsi可能会提出这个建议,但我不清楚.您可以为类定义置信度阈值.如果预测的概率未达到任何类别的阈值(在您的示例中为"意大利语"和"早餐"),则您无法对产生"其他""类"的样本进行分类.
我说"上课",因为其他不完全是一个阶级.你可能不希望你的分类器善于预测"其他",所以这个置信度阈值可能是一个好方法.
您可以尝试在创建 MultinomialNB 时设置类先验。您可以创建一个虚拟的“其他”训练示例,然后将“其他”的先验设置得足够高,以便在没有足够的证据来选择其他类时实例默认为“其他”。