为什么Spark ML NaiveBayes输出的标签与训练数据不同？

Question

为什么Spark ML NaiveBayes输出的标签与训练数据不同？

Pim*_*kos 5 scala machine-learning apache-spark naivebayes apache-spark-ml

我使用Apache Spark ML（版本1.5.1）中的NaiveBayes分类器来预测一些文本类别。但是，分类器输出的标签与我的训练集中的标签不同。我做错了吗？

这是一个可以粘贴到例如Zeppelin笔记本的小例子：

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
  (0L, "X totally sucks :-(", 100.0),
  (1L, "Today was kind of meh", 200.0),
  (2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val nb = new NaiveBayes()

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, nb))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
  (4L, "roller coasters are fun :-)"),
  (5L, "i burned my bacon :-("),
  (6L, "the movie is kind of meh")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prediction: Double) =>
    println(s"($id, $text) --> prediction=$prediction")
  }

Run Code Online (Sandbox Code Playgroud)

小程序的输出：

(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0

Run Code Online (Sandbox Code Playgroud)

预测标签集{0.0，1.0，2.0}与我的训练集标签{100.0，200.0，300.0}不相交。

问题：如何将这些预测的标签映射回我的原始训练集标签？

额外的问题：当其他任何类型的标签都和标签一样工作时，为什么训练集标签必须是双标签？似乎没有必要。

Answer 1

zer*_*323 4

但是，分类器输出的标签与我的训练集中的标签不同。我做错了吗？

有点儿。据我所知，您遇到了SPARK-9137描述的问题。一般来说，ML 中的所有分类器都期望基于 0 的标签（0.0、1.0、2.0，...），但ml.NaiveBayes. 在底层，数据被传递到mllib.NaiveBayes没有这个限制的地方，因此训练过程可以顺利进行。

当模型转换回时ml，预测函数只是假设标签正确，并使用返回预测标签Vector.argmax，从而得到结果。您可以使用例如修复标签StringIndexer。

当任何其他类型都可以像标签一样工作时，为什么训练集标签必须是双精度的？

我想这主要是保持简单且可重用的 API 的问题。这种方法LabeledPoint可以用于分类和回归问题。此外，它在内存使用和计算成本方面是一种有效的表示。

我什至认为，强迫用户选择 0-n 范围内的双类型标签从一开始就是不直观的。数据的标签通常是字符串，就像名称一样。这迫使用户将这些标签映射到双打作为预处理，这是无聊的样板代码。 (2认同)

归档时间：	10 年，3 月前
查看次数：	1962 次
最近记录：	7 年，10 月前