scikit-learn - 具有置信区间的ROC曲线

use*_*189 19 python confidence-interval roc scikit-learn

我能够得到使用ROC曲线scikit-learnfpr,tpr,thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1),其中y_true基于价值观的名单上我的黄金标准(即0负和1为正的情况下),并且y_pred是分数(例如一个对应列表,0.053497243,0.008521122,0.022781548,0.101885263,0.012913795,0.0,0.042881547[...])

我试图弄清楚如何在该曲线上添加置信区间,但是没有找到任何简单的方法来使用sklearn.

ogr*_*sel 26

您可以引导roc计算(使用原始/ 替换新版本的y_true/ 取样并重新计算每次的新值)并以这种方式估计置信区间.y_predy_truey_predroc_curve

为了将列车测试分裂引起的变化考虑在内,您还可以多次使用ShuffleSplit CV迭代器,在火车分割上拟合模型,y_pred为每个模型生成,从而收集roc_curves 的经验分布,最后计算置信度那些间隔.

编辑:python中的boostrapping

以下是从单个模型的预测中引导ROC AUC分数的示例.我选择重新启动ROC AUC以使其更易于作为Stack Overflow应答,但它可以适用于引导整个曲线:

import numpy as np
from scipy.stats import sem
from sklearn.metrics import roc_auc_score

y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04])
y_true = np.array([0,    1,    0,    0,    1,    1,    0,    1,    0   ])

print("Original ROC area: {:0.3f}".format(roc_auc_score(y_true, y_pred)))

n_bootstraps = 1000
rng_seed = 42  # control reproducibility
bootstrapped_scores = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    # bootstrap by sampling with replacement on the prediction indices
    indices = rng.randint(0, len(y_pred), len(y_pred))
    if len(np.unique(y_true[indices])) < 2:
        # We need at least one positive and one negative sample for ROC AUC
        # to be defined: reject the sample
        continue

    score = roc_auc_score(y_true[indices], y_pred[indices])
    bootstrapped_scores.append(score)
    print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))
Run Code Online (Sandbox Code Playgroud)

您可以看到我们需要拒绝一些无效的重新采样.然而,对于具有许多预测的真实数据,这是非常罕见的事件,并且不应显着影响置信区间(您可以尝试改变rng_seed以检查).

这是直方图:

import matplotlib.pyplot as plt
plt.hist(bootstrapped_scores, bins=50)
plt.title('Histogram of the bootstrapped ROC AUC scores')
plt.show()
Run Code Online (Sandbox Code Playgroud)

自举ROC AUC分数的直方图

请注意,重新采样的分数在[0-1]范围内被删除,导致最后一个分区中的分数很高.

要获得置信区间,可以对样本进行排序:

sorted_scores = np.array(bootstrapped_scores)
sorted_scores.sort()

# Computing the lower and upper bound of the 90% confidence interval
# You can change the bounds percentiles to 0.025 and 0.975 to get
# a 95% confidence interval instead.
confidence_lower = sorted_scores[int(0.05 * len(sorted_scores))]
confidence_upper = sorted_scores[int(0.95 * len(sorted_scores))]
print("Confidence interval for the score: [{:0.3f} - {:0.3}]".format(
    confidence_lower, confidence_upper))
Run Code Online (Sandbox Code Playgroud)

这使:

Confidence interval for the score: [0.444 - 1.0]
Run Code Online (Sandbox Code Playgroud)

置信区间非常宽,但这可能是我选择预测的结果(9个预测中的3个错误),预测总数非常小.

关于情节的另一个评论:分数被量化(许多空的直方图箱).这是少数预测的结果.可以在分数(或y_pred值)上引入一点高斯噪声以平滑分布并使直方图看起来更好.但是,平滑带宽的选择是棘手的.

最后如前所述,此置信区间特定于您的训练集.为了更好地估计模型类和参数引起的ROC的可变性,您应该进行迭代交叉验证.然而,由于您需要为每个随机列车/测试分组训练新模型,因此这通常要昂贵得多.

  • 使用“numpy.random.random_integer”进行替换采样来实现引导很简单。我编辑了答案。 (2认同)

Rau*_*aul 8

DeLong解决方案[无引导程序]

正如这里的一些建议,一种pROC方法会很好。根据pROC 文档,置信区间是通过DeLong计算的:

DeLong是一种渐近精确的方法,用于评估AUC的不确定性(DeLong等人(1988))。从1.9版开始,pROC使用Sun和Xu(2014)提出的算法,该算法的复杂度为O(N log N),并且始终比自举更快。默认情况下,pROC将尽可能选择DeLong方法。

Yandex数据学校在其公共存储库中实现了快速DeLong实施:

https://github.com/yandexdataschool/roc_comparison

因此,本示例中使用的DeLong实现的全部功劳。因此,这是通过DeLong获得CI的方法:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Nov  6 10:06:52 2018

@author: yandexdataschool

Original Code found in:
https://github.com/yandexdataschool/roc_comparison

updated: Raul Sanchez-Vazquez
"""

import numpy as np
import scipy.stats
from scipy import stats

# AUC comparison adapted from
# https://github.com/Netflix/vmaf/
def compute_midrank(x):
    """Computes midranks.
    Args:
       x - a 1D numpy array
    Returns:
       array of midranks
    """
    J = np.argsort(x)
    Z = x[J]
    N = len(x)
    T = np.zeros(N, dtype=np.float)
    i = 0
    while i < N:
        j = i
        while j < N and Z[j] == Z[i]:
            j += 1
        T[i:j] = 0.5*(i + j - 1)
        i = j
    T2 = np.empty(N, dtype=np.float)
    # Note(kazeevn) +1 is due to Python using 0-based indexing
    # instead of 1-based in the AUC formula in the paper
    T2[J] = T + 1
    return T2


def compute_midrank_weight(x, sample_weight):
    """Computes midranks.
    Args:
       x - a 1D numpy array
    Returns:
       array of midranks
    """
    J = np.argsort(x)
    Z = x[J]
    cumulative_weight = np.cumsum(sample_weight[J])
    N = len(x)
    T = np.zeros(N, dtype=np.float)
    i = 0
    while i < N:
        j = i
        while j < N and Z[j] == Z[i]:
            j += 1
        T[i:j] = cumulative_weight[i:j].mean()
        i = j
    T2 = np.empty(N, dtype=np.float)
    T2[J] = T
    return T2


def fastDeLong(predictions_sorted_transposed, label_1_count, sample_weight):
    if sample_weight is None:
        return fastDeLong_no_weights(predictions_sorted_transposed, label_1_count)
    else:
        return fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight)


def fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight):
    """
    The fast version of DeLong's method for computing the covariance of
    unadjusted AUC.
    Args:
       predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples]
          sorted such as the examples with label "1" are first
    Returns:
       (AUC value, DeLong covariance)
    Reference:
     @article{sun2014fast,
       title={Fast Implementation of DeLong's Algorithm for
              Comparing the Areas Under Correlated Receiver Oerating Characteristic Curves},
       author={Xu Sun and Weichao Xu},
       journal={IEEE Signal Processing Letters},
       volume={21},
       number={11},
       pages={1389--1393},
       year={2014},
       publisher={IEEE}
     }
    """
    # Short variables are named as they are in the paper
    m = label_1_count
    n = predictions_sorted_transposed.shape[1] - m
    positive_examples = predictions_sorted_transposed[:, :m]
    negative_examples = predictions_sorted_transposed[:, m:]
    k = predictions_sorted_transposed.shape[0]

    tx = np.empty([k, m], dtype=np.float)
    ty = np.empty([k, n], dtype=np.float)
    tz = np.empty([k, m + n], dtype=np.float)
    for r in range(k):
        tx[r, :] = compute_midrank_weight(positive_examples[r, :], sample_weight[:m])
        ty[r, :] = compute_midrank_weight(negative_examples[r, :], sample_weight[m:])
        tz[r, :] = compute_midrank_weight(predictions_sorted_transposed[r, :], sample_weight)
    total_positive_weights = sample_weight[:m].sum()
    total_negative_weights = sample_weight[m:].sum()
    pair_weights = np.dot(sample_weight[:m, np.newaxis], sample_weight[np.newaxis, m:])
    total_pair_weights = pair_weights.sum()
    aucs = (sample_weight[:m]*(tz[:, :m] - tx)).sum(axis=1) / total_pair_weights
    v01 = (tz[:, :m] - tx[:, :]) / total_negative_weights
    v10 = 1. - (tz[:, m:] - ty[:, :]) / total_positive_weights
    sx = np.cov(v01)
    sy = np.cov(v10)
    delongcov = sx / m + sy / n
    return aucs, delongcov


def fastDeLong_no_weights(predictions_sorted_transposed, label_1_count):
    """
    The fast version of DeLong's method for computing the covariance of
    unadjusted AUC.
    Args:
       predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples]
          sorted such as the examples with label "1" are first
    Returns:
       (AUC value, DeLong covariance)
    Reference:
     @article{sun2014fast,
       title={Fast Implementation of DeLong's Algorithm for
              Comparing the Areas Under Correlated Receiver Oerating
              Characteristic Curves},
       author={Xu Sun and Weichao Xu},
       journal={IEEE Signal Processing Letters},
       volume={21},
       number={11},
       pages={1389--1393},
       year={2014},
       publisher={IEEE}
     }
    """
    # Short variables are named as they are in the paper
    m = label_1_count
    n = predictions_sorted_transposed.shape[1] - m
    positive_examples = predictions_sorted_transposed[:, :m]
    negative_examples = predictions_sorted_transposed[:, m:]
    k = predictions_sorted_transposed.shape[0]

    tx = np.empty([k, m], dtype=np.float)
    ty = np.empty([k, n], dtype=np.float)
    tz = np.empty([k, m + n], dtype=np.float)
    for r in range(k):
        tx[r, :] = compute_midrank(positive_examples[r, :])
        ty[r, :] = compute_midrank(negative_examples[r, :])
        tz[r, :] = compute_midrank(predictions_sorted_transposed[r, :])
    aucs = tz[:, :m].sum(axis=1) / m / n - float(m + 1.0) / 2.0 / n
    v01 = (tz[:, :m] - tx[:, :]) / n
    v10 = 1.0 - (tz[:, m:] - ty[:, :]) / m
    sx = np.cov(v01)
    sy = np.cov(v10)
    delongcov = sx / m + sy / n
    return aucs, delongcov


def calc_pvalue(aucs, sigma):
    """Computes log(10) of p-values.
    Args:
       aucs: 1D array of AUCs
       sigma: AUC DeLong covariances
    Returns:
       log10(pvalue)
    """
    l = np.array([[1, -1]])
    z = np.abs(np.diff(aucs)) / np.sqrt(np.dot(np.dot(l, sigma), l.T))
    return np.log10(2) + scipy.stats.norm.logsf(z, loc=0, scale=1) / np.log(10)


def compute_ground_truth_statistics(ground_truth, sample_weight):
    assert np.array_equal(np.unique(ground_truth), [0, 1])
    order = (-ground_truth).argsort()
    label_1_count = int(ground_truth.sum())
    if sample_weight is None:
        ordered_sample_weight = None
    else:
        ordered_sample_weight = sample_weight[order]

    return order, label_1_count, ordered_sample_weight


def delong_roc_variance(ground_truth, predictions, sample_weight=None):
    """
    Computes ROC AUC variance for a single set of predictions
    Args:
       ground_truth: np.array of 0 and 1
       predictions: np.array of floats of the probability of being class 1
    """
    order, label_1_count, ordered_sample_weight = compute_ground_truth_statistics(
        ground_truth, sample_weight)
    predictions_sorted_transposed = predictions[np.newaxis, order]
    aucs, delongcov = fastDeLong(predictions_sorted_transposed, label_1_count, ordered_sample_weight)
    assert len(aucs) == 1, "There is a bug in the code, please forward this to the developers"
    return aucs[0], delongcov


alpha = .95
y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04])
y_true = np.array([0,    1,    0,    0,    1,    1,    0,    1,    0   ])

auc, auc_cov = delong_roc_variance(
    y_true,
    y_pred)

auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - alpha) / 2)

ci = stats.norm.ppf(
    lower_upper_q,
    loc=auc,
    scale=auc_std)

ci[ci > 1] = 1

print('AUC:', auc)
print('AUC COV:', auc_cov)
print('95% AUC CI:', ci)
Run Code Online (Sandbox Code Playgroud)

输出:

AUC: 0.8
AUC COV: 0.028749999999999998
95% AUC CI: [0.46767194, 1.]
Run Code Online (Sandbox Code Playgroud)

我还检查了此实现是否与pROC从中获得的结果匹配R

library(pROC)

y_true = c(0,    1,    0,    0,    1,    1,    0,    1,    0)
y_pred = c(0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04)

# Build a ROC object and compute the AUC
roc = roc(y_true, y_pred)
roc
Run Code Online (Sandbox Code Playgroud)

输出:

Call:
roc.default(response = y_true, predictor = y_pred)

Data: y_pred in 5 controls (y_true 0) < 4 cases (y_true 1).
Area under the curve: 0.8
Run Code Online (Sandbox Code Playgroud)

然后

# Compute the Confidence Interval
ci(roc)
Run Code Online (Sandbox Code Playgroud)

输出

95% CI: 0.4677-1 (DeLong)
Run Code Online (Sandbox Code Playgroud)

  • 在您的消息之后,我对具有不同操作系统、R/Python 和各种版本的软件包的 5 种不同设置进行了一些更详细的测试。我使用 Jupyter 的一种 Python 设置(链接文件中的#3)似乎给出了与其他设置不同的结果。我没有进一步跟踪它,但我的第一个怀疑是 scipy 版本 1.3.0。以下是包含测试数据和我的测试结果的 csv: https://www101.zippyshare.com/v/V1VO0z08/file.html https://www101.zippyshare.com/v/Nh4q08zM/file.html (2认同)