lem*_*mon 5 python machine-learning neural-network scikit-learn
我的神经网络的输出是多标签分类的预测类概率表:
print(probabilities)
| | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|--------------|--------------|-----|--------------|--------------|--------------|
| 0 | 2.442745e-05 | 5.952136e-06 | ... | 4.254002e-06 | 1.894523e-05 | 1.033957e-05 |
| 1 | 7.685694e-05 | 3.252202e-06 | ... | 3.617730e-06 | 1.613792e-05 | 7.356643e-06 |
| 2 | 2.296657e-06 | 4.859554e-06 | ... | 9.934525e-06 | 9.244772e-06 | 1.377618e-05 |
| 3 | 5.163169e-04 | 1.044035e-04 | ... | 1.435158e-04 | 2.807420e-04 | 2.346930e-04 |
| 4 | 2.484626e-06 | 2.074290e-06 | ... | 9.958628e-06 | 6.002510e-06 | 8.434519e-06 |
| 5 | 1.297477e-03 | 2.211737e-04 | ... | 1.881772e-04 | 3.171079e-04 | 3.228884e-04 |
Run Code Online (Sandbox Code Playgroud)
我使用阈值(0.2)将其转换为类标签来测量我的预测的准确性:
predictions = (probabilities > 0.2).astype(np.int)
print(predictions)
| | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... | 0 | 0 | 0 |
| 1 | 0 | 0 | ... | 0 | 0 | 0 |
| 2 | 0 | 0 | ... | 0 | 0 | 0 |
| 3 | 0 | 0 | ... | 0 | 0 | 0 |
| 4 | 0 | 0 | ... | 0 | 0 | 0 |
| 5 | 0 | 0 | ... | 0 | 0 | 0 |
Run Code Online (Sandbox Code Playgroud)
我还有一个测试集:
print(Y_test)
| | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... | 0 | 0 | 0 |
| 1 | 0 | 0 | ... | 0 | 0 | 0 |
| 2 | 0 | 0 | ... | 0 | 0 | 0 |
| 3 | 0 | 0 | ... | 0 | 0 | 0 |
| 4 | 0 | 0 | ... | 0 | 0 | 0 |
| 5 | 0 | 0 | ... | 0 | 0 | 0 |
Run Code Online (Sandbox Code Playgroud)
问题:如何在 Python 中构建一个算法来选择最大化的最佳阈值roc_auc_score(average = 'micro')
或其他指标?
也许可以在 Python 中构建手动函数来优化阈值,具体取决于准确性指标。
我假设你的真实标签是Y_test
,预测是predictions
。
roc_auc_score(average = 'micro')
根据预测进行优化threshold
似乎没有意义,因为 AUC 是根据预测的排名方式计算的,因此需要predictions
作为 中的浮点值[0,1]
。
因此,我将讨论accuracy_score
。
你可以使用scipy.optimize.fmin
:
import scipy
from sklearn.metrics import accuracy_score
def thr_to_accuracy(thr, Y_test, predictions):
return -accuracy_score(Y_test, np.array(predictions>thr, dtype=np.int))
best_thr = scipy.optimize.fmin(thr_to_accuracy, args=(Y_test, predictions), x0=0.5)
Run Code Online (Sandbox Code Playgroud)
根据@cangrejo的回答: https://stats.stackexchange.com/a/310956/194535,假设你的模型的原始输出概率是向量v,然后你可以定义先验分布:
\n\n\xcf\x80=(1/\xce\xb81, 1/\xce\xb82,..., 1/\xce\xb8N),对于 \xce\xb8i\xe2\x88\x88(0,1) 和 \ xce\xa3\xce\xb8i = 1,其中N是标记类的总数,i是类索引。
\n\n将 v\' = v\xe2\x8a\x99\xcf\x80 作为模型的新输出概率,其中 \xe2\x8a\x99 表示逐元素乘积。
\n\n现在,您的问题可以重新表述为:从新的输出概率模型中查找优化您指定的指标(例如roc_auc_score)的 \xcf\x80 。一旦找到它,\xce\xb8s(\xce\xb81、\xce\xb82、...、\xce\xb8N)就是每个类别的最佳阈值。
\n\n代码部分:
\n\n创建一个proxyModel
类,将原始模型对象作为参数并返回一个proxyModel
对象。当您predict_proba()
通过proxyModel
对象调用时,它将根据您指定的阈值自动计算新的概率:
class proxyModel():\n def __init__(self, origin_model):\n self.origin_model = origin_model\n\n def predict_proba(self, x, threshold_list=None):\n # get origin probability\n ori_proba = self.origin_model.predict_proba(x)\n\n # set default threshold\n if threshold_list is None:\n threshold_list = np.full(ori_proba[0].shape, 1)\n\n # get the output shape of threshold_list\n output_shape = np.array(threshold_list).shape\n\n # element-wise divide by the threshold of each classes\n new_proba = np.divide(ori_proba, threshold_list)\n\n # calculate the norm (sum of new probability of each classes)\n norm = np.linalg.norm(new_proba, ord=1, axis=1)\n\n # reshape the norm\n norm = np.broadcast_to(np.array([norm]).T, (norm.shape[0],output_shape[0]))\n\n # renormalize the new probability\n new_proba = np.divide(new_proba, norm)\n\n return new_proba\n\n def predict(self, x, threshold_list=None):\n return np.argmax(self.predict_proba(x, threshold_list), axis=1)\n
Run Code Online (Sandbox Code Playgroud)实现评分函数:
\n\ndef scoreFunc(model, X, y_true, threshold_list):\n y_pred = model.predict(X, threshold_list=threshold_list)\n y_pred_proba = model.predict_proba(X, threshold_list=threshold_list)\n\n ###### metrics ######\n from sklearn.metrics import accuracy_score\n from sklearn.metrics import roc_auc_score\n from sklearn.metrics import average_precision_score\n from sklearn.metrics import f1_score\n\n accuracy = accuracy_score(y_true, y_pred)\n roc_auc = roc_auc_score(y_true, y_pred_proba, average=\'macro\')\n pr_auc = average_precision_score(y_true, y_pred_proba, average=\'macro\')\n f1_value = f1_score(y_true, y_pred, average=\'macro\')\n\n return accuracy, roc_auc, pr_auc, f1_value\n\n
Run Code Online (Sandbox Code Playgroud)定义weighted_score_with_threshold()
函数,以阈值作为输入并返回加权分数:
def weighted_score_with_threshold(threshold, model, X_test, Y_test, metrics=\'accuracy\', delta=5e-5):\n # if the sum of thresholds were not between 1+delta and 1-delta, \n # return infinity (just for reduce the search space of the minimizaiton algorithm, \n # because the sum of thresholds should be as close to 1 as possible).\n threshold_sum = np.sum(threshold)\n\n if threshold_sum > 1+delta:\n return np.inf\n\n if threshold_sum < 1-delta:\n return np.inf\n\n # to avoid objective function jump into nan solution\n if np.isnan(threshold_sum):\n print("threshold_sum is nan")\n return np.inf\n\n # renormalize: the sum of threshold should be 1\n normalized_threshold = threshold/threshold_sum\n\n # calculate scores based on thresholds\n # suppose it\'ll return 4 scores in a tuple: (accuracy, roc_auc, pr_auc, f1)\n scores = scoreFunc(model, X_test, Y_test, threshold_list=normalized_threshold) \n\n scores = np.array(scores)\n weight = np.array([1,1,1,1])\n\n # Give the metric you want to maximize a bigger weight:\n if metrics == \'accuracy\':\n weight = np.array([10,1,1,1])\n elif metrics == \'roc_auc\':\n weight = np.array([1,10,1,1])\n elif metrics == \'pr_auc\':\n weight = np.array([1,1,10,1])\n elif metrics == \'f1\':\n weight = np.array([1,1,1,10])\n elif \'all\':\n weight = np.array([1,1,1,1])\n\n # return negatitive weighted sum (because you want to maximize the sum, \n # it\'s equivalent to minimize the negative sum)\n return -np.dot(weight, scores)\n
Run Code Online (Sandbox Code Playgroud)使用优化算法differential_evolution()
(比 fmin 更好)找到最佳阈值:
from scipy import optimize\n\noutput_class_num = Y_test.shape[1]\nbounds = optimize.Bounds([1e-5]*output_class_num,[1]*output_class_num)\n\npmodel = proxyModel(model)\n\nresult = optimize.differential_evolution(weighted_score_with_threshold, bounds, args=(pmodel, X_test, Y_test, \'accuracy\'))\n\n# calculate threshold\nthreshold = result.x/np.sum(result.x)\n\n# print the optimized score\nprint(scoreFunc(model, X_test, Y_test, threshold_list=threshold))\n\n
Run Code Online (Sandbox Code Playgroud) 归档时间: |
|
查看次数: |
9408 次 |
最近记录: |