JBG*_*ber 4 r supervised-learning text-classification r-caret quanteda
我正在尝试使用这些包quanteda
,caret
并根据经过训练的样本对文本进行分类.作为试运行,我想比较的内置的朴素贝叶斯分类器quanteda
与的那些caret
.但是,我似乎caret
无法正常工作.
这是一些复制代码.首先是quanteda
侧面:
library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)
# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
dfm(stem = TRUE)
# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
dfm(stem = TRUE) %>%
dfm_select(pattern = training_dfm,
selection = "keep")
# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#> predicted_class
#> actual_class neg pos
#> neg 202 47
#> pos 49 202
Run Code Online (Sandbox Code Playgroud)
不错.没有调整,准确率为80.8%.现在一样(据我所知)caret
training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = TRUE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#> predicted_class_caret
#> actual_class neg pos
#> neg 246 3
#> pos 249 2
Run Code Online (Sandbox Code Playgroud)
这里的准确度不仅低得惊人(49.6% - 几乎是机会),因此几乎没有预测过pos级!所以我很确定我在这里缺少一些关键的东西,因为我认为实现应该非常相似,但不确定是什么.
我已经查看了quanteda
函数的源代码(希望它可能是构建在caret
底层或底层的包),并看到有一些加权和平滑正在进行.如果我在训练之前将其应用于我的dfm(laplace = 0
稍后设置),准确性会更好一些.然而也只有53%.
答案是插入符号(它使用naive_bayes
从naivebayes包)假设的高斯分布,而quanteda::textmodel_nb()
是基于更文本适当多项式分布(具有伯努利分布的选项以及).
还引用了textmodel_nb()
复制IIR书籍(Manning,Raghavan和Schütze2008)以及Jurafsky和Martin(2018)的另一个例子的文档.看到:
Manning,Christopher D.,Prabhakar Raghavan和HinrichSchütze.2008.信息检索简介.剑桥大学出版社(第13章).https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky,Daniel和James H. Martin.2018.语音和语言处理.自然语言处理,计算语言学和语音识别简介.第3版草案,2018年9月23日(第4章).https://web.stanford.edu/~jurafsky/slp3/4.pdf
另一个软件包e1071产生与您发现的相同的结果,因为它也基于高斯分布.
library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
## nb_e1071_pred
## actual_class neg pos
## neg 246 3
## pos 249 2
Run Code Online (Sandbox Code Playgroud)
然而,插入符号和e1071都在密集矩阵上运行,这是他们与稀疏dfm操作的quanteda方法相比如此令人头脑麻木的原因之一.因此,从适当性,效率和(根据您的结果)分类器的性能的角度来看,应该非常清楚哪一个是首选的!
library("rbenchmark")
benchmark(
quanteda = {
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
},
caret = {
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = FALSE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
},
e1071 = {
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
},
replications = 1
)
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 caret 1 29.042 123.583 25.896 3.095 0 0
## 3 e1071 1 217.177 924.157 215.587 1.169 0 0
## 1 quanteda 1 0.235 1.000 0.213 0.023 0 0
Run Code Online (Sandbox Code Playgroud)