Cer*_*rin 4 python scikit-learn
您如何让sklearn 的SGDClassifier预测显示不确定性?
我试图确认SGDClassifier将报告 50% 的输入数据的概率,这些数据不严格对应于任何标签。但是,我发现分类器始终是 100% 确定的。
我正在使用以下脚本对此进行测试:
from sklearn.linear_model import SGDClassifier
c = SGDClassifier(loss="log")
#c = SGDClassifier(loss="modified_huber")
X = [
# always -1
[1,0,0],
[1,0,0],
[1,0,0],
[1,0,0],
# always +1
[0,0,1],
[0,0,1],
[0,0,1],
[0,0,1],
# uncertain
[0,1,0],
[0,1,0],
[0,1,0],
[0,1,0],
[0,1,0],
[0,1,0],
[0,1,0],
[0,1,0],
]
y = [
-1,
-1,
-1,
-1,
+1,
+1,
+1,
+1,
-1,
+1,
-1,
+1,
-1,
+1,
-1,
+1,
]
def lookup_prob_class(c, dist):
a = sorted(zip(dist, c.classes_))
best_prob, best_class = a[-1]
return best_prob, best_class
c.fit(X, y)
probs = c.predict_proba(X)
print 'probs:'
for dist, true_value in zip(probs, y):
prob, value = lookup_prob_class(c, dist)
print '%.02f'%prob, value, true_value
Run Code Online (Sandbox Code Playgroud)
如您所见,我的训练数据总是将 -1 与输入数据 [1,0,0]、+1 与 [0,0,1] 和 50/50 与 [0,1,0] 相关联。
因此,我希望结果 frompredict_proba()为输入 [0,1,0] 返回 0.5。但相反,它报告的概率为 100%。这是为什么,我该如何解决?
有趣的是,换出SGDClassifier的DecisionTreeClassifier或RandomForestClassifier不生产我期望的输出。
它确实显示出一些不确定性:
>>> c.predict_proba(X)
array([[ 9.97254333e-01, 2.74566740e-03],
[ 9.97254333e-01, 2.74566740e-03],
[ 9.97254333e-01, 2.74566740e-03],
[ 9.97254333e-01, 2.74566740e-03],
[ 1.61231111e-06, 9.99998388e-01],
[ 1.61231111e-06, 9.99998388e-01],
[ 1.61231111e-06, 9.99998388e-01],
[ 1.61231111e-06, 9.99998388e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01],
[ 1.24171982e-04, 9.99875828e-01]])
Run Code Online (Sandbox Code Playgroud)
如果您希望模型更不确定,则必须对其进行更强的正则化。这是通过调整alpha参数来完成的:
>>> c = SGDClassifier(loss="log", alpha=1)
>>> c.fit(X, y)
SGDClassifier(alpha=1, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
loss='log', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5,
random_state=None, shuffle=False, verbose=0, warm_start=False)
>>> c.predict_proba(X)
array([[ 0.58782817, 0.41217183],
[ 0.58782817, 0.41217183],
[ 0.58782817, 0.41217183],
[ 0.58782817, 0.41217183],
[ 0.53000442, 0.46999558],
[ 0.53000442, 0.46999558],
[ 0.53000442, 0.46999558],
[ 0.53000442, 0.46999558],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761],
[ 0.55579239, 0.44420761]])
Run Code Online (Sandbox Code Playgroud)
alpha是对高特征权重的惩罚,因此 越高alpha,允许权重增长越少,线性模型值变得越不极端,逻辑概率估计越接近 ½。通常,此参数使用交叉验证进行调整。
| 归档时间: |
|
| 查看次数: |
1906 次 |
| 最近记录: |