SPi*_*SPi 6 r classification machine-learning text-mining
我试图根据他们的情绪将推文分为三类(买入,持有,卖出).我正在使用R和e1071包.
我有两个数据框:一个训练集和一组新推文,需要预测情绪.
训练集数据框:
+--------------------------------------------------+
**text | sentiment**
*this stock is a good buy* | Buy
*markets crash in tokyo* | Sell
*everybody excited about new products* | Hold
+--------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
现在我想使用推文文本trainingset[,2]
和情绪类别来训练模型 trainingset[,4]
.
classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)
Run Code Online (Sandbox Code Playgroud)
用分析查看分类器的元素
classifier$tables$x
我发现条件概率是计算出来的.每一条有关买入,持有和卖出的推文都有不同的概率.太好了.
但是当我预测训练集时:
predict(classifier, trainingset[,2], type="raw")
我得到的分类仅基于先验概率,这意味着每条推文都被归类为Hold(因为"Hold"在情绪中占有最大份额).所以每条推文都有相同的买入,持有和卖出概率:
+--------------------------------------------------+
**Id | Buy | Hold | Sell**
1 |0.25 | 0.5 | 0.25
2 |0.25 | 0.5 | 0.25
3 |0.25 | 0.5 | 0.25
.. |..... | .... | ...
N |0.25 | 0.5 | 0.25
+--------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我有什么想法我做错了吗?感谢您的帮助!
谢谢
看起来您使用整个句子作为输入来训练模型,而您似乎希望使用单词作为输入要素.
用法:
Run Code Online (Sandbox Code Playgroud)## S3 method for class 'formula' naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) ## Default S3 method: naiveBayes(x, y, laplace = 0, ...) ## S3 method for class 'naiveBayes' predict(object, newdata, type = c("class", "raw"), threshold = 0.001, ...)
参数:
Run Code Online (Sandbox Code Playgroud)x: A numeric matrix, or a data frame of categorical and/or numeric variables. y: Class vector.
特别是,如果你这样训练naiveBayes
:
x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad"))
bayes<-naiveBayes( x,y )
Run Code Online (Sandbox Code Playgroud)
你得到一个能够识别这两句话的分类器:
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = x,y = y)
A-priori probabilities:
y
bad good
0.5 0.5
Conditional probabilities:
x
x
y john likes cake marry likes cats and john
bad 0 1
good 1 0
Run Code Online (Sandbox Code Playgroud)
要实现单词级别分类器,您需要使用单词作为输入来运行它
x <- c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad", "bad", "bad", "bad","bad") )
bayes<-naiveBayes( x,y )
Run Code Online (Sandbox Code Playgroud)
你得到
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = x,y = y)
A-priori probabilities:
y
bad good
0.625 0.375
Conditional probabilities:
x
y and cake cats john likes marry
bad 0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000
Run Code Online (Sandbox Code Playgroud)
一般来说R
不太适合处理NLP数据,python
(或者至少Java
)会是更好的选择.
要将句子转换为单词,您可以使用该strsplit
功能
unlist(strsplit("john likes cake"," "))
[1] "john" "likes" "cake"
Run Code Online (Sandbox Code Playgroud)