Van*_*era 5 machine-learning logistic-regression apache-spark pyspark apache-spark-ml
我想执行多项逻辑回归,但我无法正确设置threshold和thresholds参数.考虑以下DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
Run Code Online (Sandbox Code Playgroud)
我的标签有3个类,所以我必须设置thresholds(复数,默认为None)而不是threshold(单数,默认为0.5).然后我写道:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Run Code Online (Sandbox Code Playgroud)
然后我想在我的DF上安装模型:
test_logit = test_logit_abst.fit(test_train_df)
Run Code Online (Sandbox Code Playgroud)
但是当执行最后一个命令时,我收到一个错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
Run Code Online (Sandbox Code Playgroud)
错误说明threshold已设定.这看起来很奇怪,因为文档说设置thresholds(复数)清除threshold(单数),因此0.5应删除该值.那么,如何清除threshold既然不clearThreshold()存在?
为了达到这个目的,我试图清除threshold这种方式:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
Run Code Online (Sandbox Code Playgroud)
这次fit命令工作,我甚至获得了模型拦截和系数:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
Run Code Online (Sandbox Code Playgroud)
但是,如果我试图得到thresholds(复数)test_logit_abst我得到一个错误:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
Run Code Online (Sandbox Code Playgroud)
这是什么意思?
作为一个进一步的细节,奇怪地(并且对我来说不可理解)反转参数设置的顺序产生了我在上面发布的第一个错误:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Run Code Online (Sandbox Code Playgroud)
为什么更改"set"指令的顺序也会改变输出?
确实是一个混乱的局面......
在简短的回答是:
setThresholds (复数)没有清除门槛(单数)似乎是一个错误setThresholds 不能达到预期的效果(可以说你不需要它)setThresholds语句probability转换后的数据帧中的列)来手动执行(尽管setThreshold(s)对于二进制分类,它可以正常工作)而现在长期回答......
让我们从二进制分类开始,调整文档中的玩具数据:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
Run Code Online (Sandbox Code Playgroud)
我们不需要thresholds在这里设置(复数) - 这threshold=0.7已经足够了,但是在用setThreshold下面说明差异时它会很有用.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Run Code Online (Sandbox Code Playgroud)
结果如下:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
Run Code Online (Sandbox Code Playgroud)
是什么意思thresholds=[0.3, 0.7]?答案在第2行,其中预测是0.0,尽管事实是1.0(0.65)的概率更高:0.65确实高于0.35,但它低于我们为此类设定的阈值(0.7)因此它不属于此类.
我们现在尝试看似相同的操作,但setThreshold(s)改为:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Run Code Online (Sandbox Code Playgroud)
很好,是吗?
setThresholds (复数)似乎确实已经清除了我们在前一行中设置的阈值(0.7)的值,正如文档中所声称的那样,但它似乎只是将其恢复到默认值0.5 ...
省略.setThreshold(0.7)会给出您自己报告的第一个错误(未显示).
反转参数设置的顺序可以解决问题(!!!),而且,渲染getThreshold(单数)和getThresholds(复数)操作(与您的情况形成对比):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Run Code Online (Sandbox Code Playgroud)
让我们现在转向多项式案例; 我们将再次坚持文档中的示例,使用来自Spark Github repo的数据(它们也应该在本地,在你的$SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt,但我在Databricks笔记本上工作); 它是一个3级的案例,带有标签{0.0, 1.0, 2.0}.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Run Code Online (Sandbox Code Playgroud)
与上面的二进制情况类似,我们thresholds(复数)的元素总和为1,让我们要求第2类的阈值为0.8:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Run Code Online (Sandbox Code Playgroud)
看起来很好,但让我们在(训练)数据集中要求预测:
mlorModel.transform(mdf).show(truncate=False)
Run Code Online (Sandbox Code Playgroud)
我只挑出了一行 - 它应该是完整输出结束时的第二行:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Run Code Online (Sandbox Code Playgroud)
滚动到右边,你会看到尽管2.0这里的类预测低于我们设定的阈值(0.8),但确实预测了行2.0- 与上面演示的二进制情况相反...
那么该怎么办?只需删除所有与阈值相关的陈述 ; 你不需要它们 - 甚至setFamily是不必要的,因为算法会自己检测你有超过2个类.这将与上述结果相同:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
Run Code Online (Sandbox Code Playgroud)
要总结:
probability级别prediction,而是应用用户定义的阈值; 此设置可能很有用,例如在数据不平衡的情况下.threshold(s)设置实际上没有效果,其中Spark将始终作为具有最高级别的类返回.predictionprobability尽管文档中有些混乱(我在其他地方已经讨论过)以及一些错误的可能性,但我要说(3)这种设计选择并不合理; 因为它在其他地方很好地争论(强调原文):
当您为新样本的每个类输出概率时,练习的统计部分结束.选择一个阈值,超过该阈值,您将新观察分类为1对0不再是统计数据的一部分.它是决策部分的一部分.
虽然上面的论点是针对二元情形的,但它也完全适用于多项式的情况......
| 归档时间: |
|
| 查看次数: |
1812 次 |
| 最近记录: |