Scikit-learn:如何获得真阳性,真阴性,假阳性和假阴性

Eus*_*una 43 python classification machine-learning scikit-learn supervised-learning

我是机器学习和scikit-learn的新手.

我的问题:

(请纠正任何类型的误解)

我有一个BIG JSON数据集,我检索它并将其存储在trainList变量中.

我预先处理它以便能够使用它.

完成后,我开始分类:

  1. 我使用kfold交叉验证方法以获得平均准确度并且我训练分类器.
  2. 我做了预测,并获得了该折叠的准确性和混淆矩阵.
  3. 在此之后,我想获得真阳性(TP),真阴性(TN),假阳性(FP)和假阴性(FN)值.我会使用这些参数来获得灵敏度和特异性,我会将它们和TP的总数添加到HTML中,以便显示带有每个标签的TP的图表.

码:

我目前的变量:

trainList #It is a list with all the data of my dataset in JSON form
labelList #It is a list with all the labels of my data 
Run Code Online (Sandbox Code Playgroud)

方法的大部分内容:

#I transform the data from JSON form to a numerical one
X=vec.fit_transform(trainList)

#I scale the matrix (don't know why but without it, it makes an error)
X=preprocessing.scale(X.toarray())

#I generate a KFold in order to make cross validation
kf = KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=1)

#I start the cross validation
for train_indices, test_indices in kf:
    X_train=[X[ii] for ii in train_indices]
    X_test=[X[ii] for ii in test_indices]
    y_train=[listaLabels[ii] for ii in train_indices]
    y_test=[listaLabels[ii] for ii in test_indices]

    #I train the classifier
    trained=qda.fit(X_train,y_train)

    #I make the predictions
    predicted=qda.predict(X_test)

    #I obtain the accuracy of this fold
    ac=accuracy_score(predicted,y_test)

    #I obtain the confusion matrix
    cm=confusion_matrix(y_test, predicted)

    #I should calculate the TP,TN, FP and FN 
    #I don't know how to continue
Run Code Online (Sandbox Code Playgroud)

小智 89

对于多类案例,您可以从混淆矩阵中找到所需的一切.例如,如果您的混淆矩阵如下所示:

混淆矩阵

然后你可以找到每个类,你可以找到这样的:

覆盖

使用pandas/numpy,您可以同时为所有类执行此操作,如下所示:

FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)  
FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
TP = np.diag(confusion_matrix)
TN = confusion_matrix.values.sum() - (FP + FN + TP)

# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)

# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
Run Code Online (Sandbox Code Playgroud)

  • 这假设您在此处使用pandas DataFrame实例来处理混淆矩阵.如果你使用numpy数组,只需删除`.values`部分. (5认同)
  • 很棒的答案@lucidv01d——我发现自己重复使用这段代码的次数足够多,以至于我编写了一个[包](https://github.com/arvkevi/disarray)来直接从 pandas DataFrame 访问这些指标。您的答案、用户名和个人资料页面已正确归属:) (4认同)
  • 当我想计算TN的值时,我得到这个错误:'numpy.ndarray'对象没有属性'values'我正在使用python 3. (3认同)
  • 这是一个超级答案。@lucidc01d 你能分享一下你是如何做到这一点的吗? (3认同)
  • 太糟糕了,应该将其标记为答案。 (2认同)
  • 谁能解释一下这个准确度(ACC)和通过“sklearn.metrics.accuracy_score”测量的准确度之间有什么区别。因为两者的结果是有差异的。通过解释,我的意思是哪一个更可靠。 (2认同)

inv*_*ell 24

如果您有两个具有预测值和实际值的列表; 就像你看到的那样,你可以把它们传递给一个函数来计算TP,FP,TN,FN,如下所示:

def perf_measure(y_actual, y_hat):
    TP = 0
    FP = 0
    TN = 0
    FN = 0

    for i in range(len(y_hat)): 
        if y_actual[i]==y_hat[i]==1:
           TP += 1
        if y_hat[i]==1 and y_actual[i]!=y_hat[i]:
           FP += 1
        if y_actual[i]==y_hat[i]==0:
           TN += 1
        if y_hat[i]==0 and y_actual[i]!=y_hat[i]:
           FN += 1

    return(TP, FP, TN, FN)
Run Code Online (Sandbox Code Playgroud)

从这里开始,我认为您将能够计算出您感兴趣的速率,以及其他性能指标,如特异性和敏感性.

  • 所有错误均应修复。 (3认同)
  • 此代码中似乎存在错误. (2认同)
  • y_actual!= y_hat [i]应该有一个索引.那应该是y_actual [i]!= y_hat [i], (2认同)

gru*_*gly 22

根据scikit-learn文档,

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

根据定义,混淆矩阵C使得C [i,j]等于已知在组i中但预测在组j中的观测数.

因此在二元分类中,真阴性的计数是C [0,0],假阴性是C [1,0],真阳性是C [1,1],假阳性是C [0,1].

CM = confusion_matrix(y_true, y_pred)

TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]
Run Code Online (Sandbox Code Playgroud)


Aks*_*rit 18

您可以从混淆矩阵中获取所有参数.混淆矩阵的结构(2X2矩阵)如下

TP|FP
FN|TN
Run Code Online (Sandbox Code Playgroud)

所以

TP = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TN = cm[1][1]
Run Code Online (Sandbox Code Playgroud)

有关详细信息,访问https://en.wikipedia.org/wiki/Confusion_matrix

  • 为此,如果你看一下维基百科的链接,就会有一个关于猫,狗和马的例子.在存在两个类别(即正面和负面)的情况下,真正积极,真正消极等概念对我来说更有意义.对于你的情况,我不确定TP,FP意味着什么.你可以把TP作为对角元素的总和,但我不确定.你可以假设一个分类为正,而所有其他分类为负值来计算TP,FP等,但我不确定. (2认同)
  • 您确定没有切换FP和FN位置吗?我以为它应该是`[TP,FN],[FP,TN]`.这也是维基百科页面上显示的内容. (2认同)

Jar*_*rno 7

在一个班轮得到真正postives等出来的混淆矩阵是的吧:

from sklearn.metrics import confusion_matrix

y_true = [1, 1, 0, 0]
y_pred = [1, 0, 1, 0]   

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(tn, fp, fn, tp)  # 1 1 1 1
Run Code Online (Sandbox Code Playgroud)


小智 6

在scikit-learn'metrics'库中,有一个confusion_matrix方法可为您提供所需的输出。

您可以使用所需的任何分类器。在这里,我以KNeighbors为例。

from sklearn import metrics, neighbors

clf = neighbors.KNeighborsClassifier()

X_test = ...
y_test = ...

expected = y_test
predicted = clf.predict(X_test)

conf_matrix = metrics.confusion_matrix(expected, predicted)

>>> print conf_matrix
>>>  [[1403   87]
     [  56 3159]]
Run Code Online (Sandbox Code Playgroud)

docs:http : //scikit-learn.org/stable/modules/generation/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

  • 迄今为止最简洁的回应 (2认同)

小智 5

我写了一个只使用 numpy 的版本。我希望它能帮助你。

import numpy as np

def perf_metrics_2X2(yobs, yhat):
    """
    Returns the specificity, sensitivity, positive predictive value, and 
    negative predictive value 
    of a 2X2 table.

    where:
    0 = negative case
    1 = positive case

    Parameters
    ----------
    yobs :  array of positive and negative ``observed`` cases
    yhat : array of positive and negative ``predicted`` cases

    Returns
    -------
    sensitivity  = TP / (TP+FN)
    specificity  = TN / (TN+FP)
    pos_pred_val = TP/ (TP+FP)
    neg_pred_val = TN/ (TN+FN)

    Author: Julio Cardenas-Rodriguez
    """
    TP = np.sum(  yobs[yobs==1] == yhat[yobs==1] )
    TN = np.sum(  yobs[yobs==0] == yhat[yobs==0] )
    FP = np.sum(  yobs[yobs==1] == yhat[yobs==0] )
    FN = np.sum(  yobs[yobs==0] == yhat[yobs==1] )

    sensitivity  = TP / (TP+FN)
    specificity  = TN / (TN+FP)
    pos_pred_val = TP/ (TP+FP)
    neg_pred_val = TN/ (TN+FN)

    return sensitivity, specificity, pos_pred_val, neg_pred_val
Run Code Online (Sandbox Code Playgroud)


小智 5

以防万一有人在MULTI-CLASS Example 中寻找相同的东西

def perf_measure(y_actual, y_pred):
    class_id = set(y_actual).union(set(y_pred))
    TP = []
    FP = []
    TN = []
    FN = []

    for index ,_id in enumerate(class_id):
        TP.append(0)
        FP.append(0)
        TN.append(0)
        FN.append(0)
        for i in range(len(y_pred)):
            if y_actual[i] == y_pred[i] == _id:
                TP[index] += 1
            if y_pred[i] == _id and y_actual[i] != y_pred[i]:
                FP[index] += 1
            if y_actual[i] == y_pred[i] != _id:
                TN[index] += 1
            if y_pred[i] != _id and y_actual[i] != y_pred[i]:
                FN[index] += 1


    return class_id,TP, FP, TN, FN
Run Code Online (Sandbox Code Playgroud)


小智 5

在 scikit 0.22 版本中,你可以这样做

from sklearn.metrics import multilabel_confusion_matrix

y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]

mcm = multilabel_confusion_matrix(y_true, y_pred,labels=["ant", "bird", "cat"])

tn = mcm[:, 0, 0]
tp = mcm[:, 1, 1]
fn = mcm[:, 1, 0]
fp = mcm[:, 0, 1]
Run Code Online (Sandbox Code Playgroud)