R:Naives贝叶斯分类器仅基于先验概率决定

SPi*_*SPi 6 r classification machine-learning text-mining

我试图根据他们的情绪将推文分为三类(买入,持有,卖出).我正在使用R和e1071包.

我有两个数据框:一个训练集和一组新推文,需要预测情绪.

训练集数据框:

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

现在我想使用推文文本trainingset[,2]和情绪类别来训练模型 trainingset[,4].

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)
Run Code Online (Sandbox Code Playgroud)

用分析查看分类器的元素

classifier$tables$x

我发现条件概率是计算出来的.每一条有关买入,持有和卖出的推文都有不同的概率.太好了.

但是当我预测训练集时:

predict(classifier, trainingset[,2], type="raw")

我得到的分类基于先验概率,这意味着每条推文都被归类为Hold(因为"Hold"在情绪中占有最大份额).所以每条推文都有相同的买入,持有和卖出概率:

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

我有什么想法我做错了吗?感谢您的帮助!

谢谢

lej*_*lot 8

看起来您使用整个句子作为输入来训练模型,而您似乎希望使用单词作为输入要素.

用法:

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)
Run Code Online (Sandbox Code Playgroud)

参数:

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.
Run Code Online (Sandbox Code Playgroud)

特别是,如果你这样训练naiveBayes:

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )
Run Code Online (Sandbox Code Playgroud)

你得到一个能够识别这两句话的分类器:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0
Run Code Online (Sandbox Code Playgroud)

要实现单词级别分类器,您需要使用单词作为输入来运行它

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )
Run Code Online (Sandbox Code Playgroud)

你得到

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000
Run Code Online (Sandbox Code Playgroud)

一般来说R不太适合处理NLP数据,python(或者至少Java)会是更好的选择.

要将句子转换为单词,您可以使用该strsplit功能

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake" 
Run Code Online (Sandbox Code Playgroud)