标签: naivebayes

SciKit使用RFECV学习特征选择和交叉验证

我仍然是机器学习的新手,并试图自己解决问题.我正在使用SciKit学习并拥有大约20,000个功能的推文数据集(n_features = 20,000).到目前为止,我的精确度,召回率和f1得分都达到了79%左右.我想使用RFECV进行特征选择并提高模型的性能.我已经阅读了SciKit学习文档,但对如何使用RFECV仍然有点困惑.

这是我到目前为止的代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics

# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
    docs_train, docs_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)

# Create the RFECV object
nb …

Run Code Online (Sandbox Code Playgroud)

machine-learning feature-selection scikit-learn cross-validation naivebayes

cir*_*lle

lucky-day

1
推荐指数

1
解决办法

2947
查看次数

如何产生混淆矩阵并找出NaïveBayes分类器的错误分类率？

使用R中的虹膜数据集,我试图将NaïveBayes分类器拟合到虹膜训练数据中,这样我就可以为朴素贝叶斯分类器生成训练数据集(预测与实际)的混淆矩阵,错误分类率是多少NaïveBayes分类器？

到目前为止,这是我的代码:

 iris$spl=sample.split(iris,SplitRatio=0.8)
 train=subset(iris, iris$spl==TRUE)
 test=subset(iris, iris$spl==FALSE)

 iris.nb <- naiveBayes(Species~.,data = train)
 iris.nb

 nb_test_predict <- predict(iris.nb, train)

Run Code Online (Sandbox Code Playgroud)

有关如何解决这个问题的任何建议？

r naivebayes

CA *_*rns

lucky-day

1
推荐指数

1
解决办法

8339
查看次数

如何使用 Tf-idf 特征来训练模型？

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf= True, 
                       min_df = 5, 
                       norm= 'l2', 
                       ngram_range= (1,2), 
                       stop_words ='english')

feature1 = tfidf.fit_transform(df.Rejoined_Stem)
array_of_feature = feature1.toarray()

Run Code Online (Sandbox Code Playgroud)

我使用上面的代码来获取我的文本文档的功能。

from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes on Lemmatized Text
X_train, X_test, y_train, y_test = train_test_split(df['Rejoined_Lemmatize'], df['Product'], random_state = 0)
X_train_counts = tfidf.fit_transform(X_train)
clf = MultinomialNB().fit(X_train_counts, y_train)
y_pred = clf.predict(tfidf.transform(X_test))

Run Code Online (Sandbox Code Playgroud)

然后我使用这段代码来训练我的模型。有人可以解释一下在训练模型时如何使用上述特征，因为在训练时 feature1 变量没有在任何地方使用？

machine-learning scikit-learn text-classification naivebayes tfidfvectorizer

mri*_*ank

lucky-day

1
推荐指数

1
解决办法

4880
查看次数

如何使用sklearn库对朴素贝叶斯进行文本分类？

我正在尝试使用朴素贝叶斯文本分类器进行文本分类.我的数据采用以下格式,根据问题和摘录,我必须决定问题的主题.培训数据有超过20K的记录.我知道SVM会是一个更好的选择,但我想使用sklearn库与Naive Bayes一起使用.

{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png).  \n\nWhat is the effective capacitance of this circuit and will the ...\r\n        "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745).  The new outlet has 3 wires coming out of it--a black, a white, …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn text-classification naivebayes

Shi*_*wal

lucky-day

0
推荐指数

1
解决办法

2215
查看次数