在PySpark多项Logistic回归中设置阈值

Van*_*era 5 machine-learning logistic-regression apache-spark pyspark apache-spark-ml

我想执行多项逻辑回归,但我无法正确设置thresholdthresholds参数.考虑以下DF:

from pyspark.ml.linalg import DenseVector

test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
                  (0, DenseVector([3.1, -2.0, -2.9])),
                  (1, DenseVector([1.0, 0.8, 0.3])),
                  (1, DenseVector([4.2, 1.4, -1.7])),
                  (0, DenseVector([-1.9, 2.5, -2.3])),
                  (2, DenseVector([2.6, -0.2, 0.2])),
                  (1, DenseVector([0.3, -3.4, 1.8])),
                  (2, DenseVector([-1.0, -3.5, 4.7]))],
                 ['label', 'features'])
)
Run Code Online (Sandbox Code Playgroud)

我的标签有3个类,所以我必须设置thresholds(复数,默认为None)而不是threshold(单数,默认为0.5).然后我写道:

from pyspark.ml import classification as cl

test_logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
)
Run Code Online (Sandbox Code Playgroud)

然后我想在我的DF上安装模型:

test_logit = test_logit_abst.fit(test_train_df)
Run Code Online (Sandbox Code Playgroud)

但是当执行最后一个命令时,我收到一个错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
    264     def _fit(self, dataset):
--> 265         java_model = self._fit_java(dataset)
    266         return self._create_model(java_model)
267

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    260         """
    261         self._transfer_params_to_java()
--> 262         return self._java_obj.fit(dataset._jdf)
263
    264     def _fit(self, dataset):

~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
1134
   1135         for temp_arg in temp_args:

~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
Run Code Online (Sandbox Code Playgroud)

错误说明threshold已设定.这看起来很奇怪,因为文档说设置thresholds(复数)清除threshold(单数),因此0.5应删除该值.那么,如何清除threshold既然不clearThreshold()存在?

为了达到这个目的,我试图清除threshold这种方式:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
    .setThreshold(None)
)
Run Code Online (Sandbox Code Playgroud)

这次fit命令工作,我甚至获得了模型拦截和系数:

test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])

test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
Run Code Online (Sandbox Code Playgroud)

但是,如果我试图得到thresholds(复数)test_logit_abst我得到一个错误:

test_logit_abst.getThresholds()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
    363         if not self.isSet(self.thresholds) and self.isSet(self.threshold):
    364             t = self.getOrDefault(self.threshold)
--> 365             return [1.0-t, t]
    366         else:
    367             return self.getOrDefault(self.thresholds)

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
Run Code Online (Sandbox Code Playgroud)

这是什么意思?


作为一个进一步的细节,奇怪地(并且对我来说不可理解)反转参数设置的顺序产生了我在上面发布的第一个错误:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThreshold(None)
    .setThresholds([.5, .5, .5])
)
Run Code Online (Sandbox Code Playgroud)

为什么更改"set"指令的顺序也会改变输出?

des*_*aut 9

确实是一个混乱的局面......

简短的回答是:

  1. setThresholds (复数)没有清除门槛(单数)似乎是一个错误
  2. 对于多项式分类(即类数> 2),setThresholds 不能达到预期的效果(可以说你不需要它)
  3. 如果您只需要在"默认"值0.5中有一些"阈值",那么您就没有问题 - 只是不要使用任何相关的参数或setThresholds语句
  4. 如果您确实需要在多项分类中将不同的决策阈值应用于不同的类,则必须通过对相应概率进行后处理(即probability转换后的数据帧中的列)来手动执行(尽管setThreshold(s)对于二进制分类,它可以正常工作)

而现在长期回答......

让我们从二进制分类开始,调整文档中的玩具数据:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
     Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
     Row(label=0.0, features=Vectors.dense(1.0, 2.0)),

blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
     Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
     Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
Run Code Online (Sandbox Code Playgroud)

我们不需要thresholds在这里设置(复数) - 这threshold=0.7已经足够了,但是在用setThreshold下面说明差异时它会很有用.

blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Run Code Online (Sandbox Code Playgroud)

结果如下:

+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction                             |probability                             |prediction| 
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0  |[-1.138455151184087,1.138455151184087]    |[0.242604109995602,0.757395890004398]   |1.0       |
|[1.0,2.0]|0.0  |[-0.6056346859838877,0.6056346859838877]  |[0.35305562698104337,0.6469443730189567]|0.0       | 
|[2.0,1.0]|1.0  |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0       | 
|[3.0,3.0]|0.0  |[1.6453673835702176,-1.6453673835702176]  |[0.8382639556951765,0.16173604430482344]|0.0       | 
+---------+-----+------------------------------------------+----------------------------------------+----------+
Run Code Online (Sandbox Code Playgroud)

是什么意思thresholds=[0.3, 0.7]?答案在第2行,其中预测是0.0,尽管事实是1.0(0.65)的概率更高:0.65确实高于0.35,但它低于我们为此类设定的阈值(0.7)因此它不属于此类.

我们现在尝试看似相同的操作,但setThreshold(s)改为:

blor2 = (LogisticRegression()
  .setThreshold(0.7)
  .setThresholds([0.3, 0.7]) ) # works OK

blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Run Code Online (Sandbox Code Playgroud)

很好,是吗?

setThresholds (复数)似乎确实已经清除了我们在前一行中设置的阈值(0.7)的值,正如文档中所声称的那样,但它似乎只是将其恢复到默认值0.5 ...

省略.setThreshold(0.7)会给出您自己报告的第一个错误(未显示).

反转参数设置的顺序可以解决问题(!!!),而且,渲染getThreshold(单数)和getThresholds(复数)操作(与您的情况形成对比):

blor2 = (LogisticRegression()
  .setThresholds([0.3, 0.7])
  .setThreshold(0.7) )

blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Run Code Online (Sandbox Code Playgroud)

让我们现在转向多项式案例; 我们将再次坚持文档中的示例,使用来自Spark Github repo的数据(它们也应该在本地,在你的$SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt,但我在Databricks笔记本上工作); 它是一个3级的案例,带有标签{0.0, 1.0, 2.0}.

data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Run Code Online (Sandbox Code Playgroud)

与上面的二进制情况类似,我们thresholds(复数)的元素总和为1,让我们要求第2类的阈值为0.8:

mlor = (LogisticRegression()
       .setFamily("multinomial")
       .setThresholds([0, 0.2, 0.8])
       .setThreshold(0.8) )
mlorModel= mlor.fit(mdf)  # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Run Code Online (Sandbox Code Playgroud)

看起来很好,但让我们在(训练)数据集中要求预测:

mlorModel.transform(mdf).show(truncate=False)
Run Code Online (Sandbox Code Playgroud)

我只挑出了一行 - 它应该是完整输出结束时的第二行:

+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ 
|label|features                                            |rawPrediction                                            |probability                                                    |prediction| 
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0  |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0       | 
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Run Code Online (Sandbox Code Playgroud)

滚动到右边,你会看到尽管2.0这里的类预测低于我们设定的阈值(0.8),但确实预测了行2.0- 与上面演示的二进制情况相反...

那么该怎么办?只需删除所有与阈值相关的陈述 ; 你不需要它们 - 甚至setFamily是不必要的,因为算法会自己检测你有超过2个类.这将与上述结果相同:

mlor = LogisticRegression() # works OK - no family, no threshold(s)
Run Code Online (Sandbox Code Playgroud)

总结:

  1. 在二元和多项式情况下,算法实际返回的是一个长度等于类数的概率向量,元素总和为1.
  2. 二进制的情况下,Spark允许您更进一步,而不是天真地选择最高probability级别prediction,而是应用用户定义的阈值; 此设置可能很有用,例如在数据不平衡的情况下.
  3. 多项式情况下,此threshold(s)设置实际上没有效果,其中Spark将始终作为具有最高级别的类返回.predictionprobability

尽管文档中有些混乱(我在其他地方已经讨论过)以及一些错误的可能性,但我要说(3)这种设计选择并不合理; 因为它在其他地方很好地争论(强调原文):

当您为新样本的每个类输出概率时,练习的统计部分结束.选择一个阈值,超过该阈值,您将新观察分类为1对0不再是统计数据的一部分.它是决策部分的一部分.

虽然上面的论点是针对二元情形的,但它也完全适用于多项式的情况......