LogisticRegressionModel手动预测

Question

LogisticRegressionModel手动预测

Alb*_*nto 9 scala logistic-regression apache-spark

我试图预测一个标签在每一行DataFrame,但不使用LinearRegressionModel的transform方法,由于醉翁之意不在酒,而不是我试图通过使用经典公式手动计算它1 / (1 + e^(-h?(x))),请注意,我是从复制的代码Apache Spark的存储库并将几乎所有东西从private对象复制BLAS到它的公共版本中.PD:我没有使用任何regParam,我只是安装了模型.

//Notice that I had to obtain intercept, and coefficients from my model
val intercept = model.intercept
val coefficients = model.coefficients

val margin: Vector => Double = (features) => {
  BLAS.dot(features, coefficients) + intercept
}

val score: Vector => Double = (features) => {
  val m = margin(features)
  1.0 / (1.0 + math.exp(-m))
}

Run Code Online (Sandbox Code Playgroud)

在定义了这些函数并获得模型的参数后,我创建了一个UDF来计算预测(它接收与a相同的特征DenseVector),稍后我将我的预测与真实模型的比较进行比较,它们是非常不同的! 那么我错过了什么？我究竟做错了什么？

val predict = udf((v: DenseVector) => {
  val recency = v(0)
  val frequency = v(1)
  val tp = score(new DenseVector(Array(recency, frequency)))
  new DenseVector(Array(tp, 1 - tp))
})

// model's predictions
val xf = model.transform(df)

df.select(col("id"), predict(col("features")).as("myprediction"))
  .join(xf, df("id") === xf("id"), "inner")
  .select(df("id"), col("probability"), col("myprediction"))
  .show

+----+--------------------+--------------------+
|  id|         probability|        myprediction|
+----+--------------------+--------------------+
|  31|[0.97579780436514...|[0.98855386037790...|
| 231|[0.97579780436514...|[0.98855386037790...|
| 431|[0.69794428333266...|           [1.0,0.0]|
| 631|[0.97579780436514...|[0.98855386037790...|
| 831|[0.97579780436514...|[0.98855386037790...|
|1031|[0.96509616791398...|[0.99917463322937...|
|1231|[0.96509616791398...|[0.99917463322937...|
|1431|[0.96509616791398...|[0.99917463322937...|
|1631|[0.94231815700848...|[0.99999999999999...|
|1831|[0.96509616791398...|[0.99917463322937...|
|2031|[0.96509616791398...|[0.99917463322937...|
|2231|[0.96509616791398...|[0.99917463322937...|
|2431|[0.95353743438055...|           [1.0,0.0]|
|2631|[0.94646924057674...|           [1.0,0.0]|
|2831|[0.96509616791398...|[0.99917463322937...|
|3031|[0.96509616791398...|[0.99917463322937...|
|3231|[0.95971207153567...|[0.99999999999996...|
|3431|[0.96509616791398...|[0.99917463322937...|
|3631|[0.96509616791398...|[0.99917463322937...|
|3831|[0.96509616791398...|[0.99917463322937...|
+----+--------------------+--------------------+

Run Code Online (Sandbox Code Playgroud)

编辑

我甚至试过在里面定义这样的函数udf,但是没有用.

def predict(coefficients: Vector, intercept: Double) = {
  udf((v: DenseVector) => {
    def margin(features: Vector, coefficients: Vector, intercept: Double): Double = {
      BLAS.dot(features, coefficients) + intercept
    }

    def score(features: Vector, coefficients: Vector, intercept: Double): Double = {
      val m = margin(features, coefficients, intercept)
      1.0 / (1.0 + math.exp(-m))
    }

    val recency = v(0)
    val frequency = v(1)
    val tp = score(new DenseVector(Array(recency, frequency)), coefficients, intercept)
    new DenseVector(Array(tp, 1 - tp))
  })
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Alb*_*nto 1

这非常尴尬，但实际上问题是因为我使用了 aPipeline并添加了 aMinMaxScaler作为阶段，因此数据集在模型训练之前进行了缩放，因此两个参数coefficients和都与缩放后的数据intercept相关联，所以当我使用它们计算预测时，结果完全有偏见。因此，为了解决这个问题，我只是对训练数据集进行了非标准化处理，这样我就可以获得这些数据和. 重新执行代码后，得到了与相同的结果。另一方面，我听取了@zero323的意见，并将和定义移动到了的第一个声明中。 coefficientsinterceptSparkmarginscoreudflambda

归档时间：	9 年，7 月前
查看次数：	340 次
最近记录：	9 年，7 月前