如何在Python中编写混淆矩阵?

Arj*_*vio 51 python machine-learning

我在Python中编写了一个混淆矩阵计算代码:

def conf_mat(prob_arr, input_arr):
        # confusion matrix
        conf_arr = [[0, 0], [0, 0]]

        for i in range(len(prob_arr)):
                if int(input_arr[i]) == 1:
                        if float(prob_arr[i]) < 0.5:
                                conf_arr[0][1] = conf_arr[0][1] + 1
                        else:
                                conf_arr[0][0] = conf_arr[0][0] + 1
                elif int(input_arr[i]) == 2:
                        if float(prob_arr[i]) >= 0.5:
                                conf_arr[1][0] = conf_arr[1][0] +1
                        else:
                                conf_arr[1][1] = conf_arr[1][1] +1

        accuracy = float(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))
Run Code Online (Sandbox Code Playgroud)

prob_arr是我的分类代码返回的数组,示例数组是这样的:

 [1.0, 1.0, 1.0, 0.41592955657342651, 1.0, 0.0053405015805891975, 4.5321494433440449e-299, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.70943426182688163, 1.0, 1.0, 1.0, 1.0]
Run Code Online (Sandbox Code Playgroud)

input_arr是数据集的原始类标签,它是这样的:

[2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1]
Run Code Online (Sandbox Code Playgroud)

我的代码试图做的是:我得到prob_arr和input_arr,并为每个类(1和2)检查它们是否被错误分类.

但我的代码只适用于两个类.如果我为多个类别的数据运行此代码,则它不起作用.我怎样才能为多个课程制作这个?

例如,对于具有三个类的数据集,它应该返回: [[21,7,3],[3,38,6],[5,4,19]]

scl*_*cls 139

Scikit-Learn提供了一项confusion_matrix功能

from sklearn.metrics import confusion_matrix
y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
confusion_matrix(y_actu, y_pred)
Run Code Online (Sandbox Code Playgroud)

输出一个Numpy数组

array([[3, 0, 0],
       [0, 1, 2],
       [2, 1, 3]])
Run Code Online (Sandbox Code Playgroud)

但您也可以使用Pandas创建一个混淆矩阵:

import pandas as pd
y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual')
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)
Run Code Online (Sandbox Code Playgroud)

你会得到一个(标记清楚的)Pandas DataFrame:

Predicted  0  1  2
Actual
0          3  0  0
1          0  1  2
2          2  1  3
Run Code Online (Sandbox Code Playgroud)

如果你添加margins=True喜欢

df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
Run Code Online (Sandbox Code Playgroud)

你会得到每行和每列的总和:

Predicted  0  1  2  All
Actual
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12
Run Code Online (Sandbox Code Playgroud)

您还可以使用以下方法获取标准化混淆矩阵:

df_conf_norm = df_confusion / df_confusion.sum(axis=1)

Predicted         0         1         2
Actual
0          1.000000  0.000000  0.000000
1          0.000000  0.333333  0.333333
2          0.666667  0.333333  0.500000
Run Code Online (Sandbox Code Playgroud)

你可以用这个confusion_matrix绘制

import matplotlib.pyplot as plt
def plot_confusion_matrix(df_confusion, title='Confusion matrix', cmap=plt.cm.gray_r):
    plt.matshow(df_confusion, cmap=cmap) # imshow
    #plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    #plt.tight_layout()
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)

plot_confusion_matrix(df_confusion)
Run Code Online (Sandbox Code Playgroud)

情节混淆矩阵

或使用以下方法绘制归一化混淆矩阵:

plot_confusion_matrix(df_conf_norm)  
Run Code Online (Sandbox Code Playgroud)

图混淆矩阵归一化

您可能也对此项目感兴趣https://github.com/pandas-ml/pandas-ml及其Pip包https://pypi.python.org/pypi/pandas_ml

有了这个包,混淆矩阵可以打印漂亮,绘图.您可以对混淆矩阵进行二值化,得到类别统计数据,如TP,TN,FP,FN,ACC,TPR,FPR,FNR,TNR(SPC),LR +,LR-,DOR,PPV,FDR,FOR,NPV和一些整体统计

In [1]: from pandas_ml import ConfusionMatrix
In [2]: y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
In [3]: y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
In [4]: cm = ConfusionMatrix(y_actu, y_pred)
In [5]: cm.print_stats()
Confusion Matrix:

Predicted  0  1  2  __all__
Actual
0          3  0  0        3
1          0  1  2        3
2          2  1  3        6
__all__    5  2  5       12


Overall Statistics:

Accuracy: 0.583333333333
95% CI: (0.27666968568210581, 0.84834777019156982)
No Information Rate: ToDo
P-Value [Acc > NIR]: 0.189264302376
Kappa: 0.354838709677
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                        0          1          2
Population                                    12         12         12
P: Condition positive                          3          3          6
N: Condition negative                          9          9          6
Test outcome positive                          5          2          5
Test outcome negative                          7         10          7
TP: True Positive                              3          1          3
TN: True Negative                              7          8          4
FP: False Positive                             2          1          2
FN: False Negative                             0          2          3
TPR: (Sensitivity, hit rate, recall)           1  0.3333333        0.5
TNR=SPC: (Specificity)                 0.7777778  0.8888889  0.6666667
PPV: Pos Pred Value (Precision)              0.6        0.5        0.6
NPV: Neg Pred Value                            1        0.8  0.5714286
FPR: False-out                         0.2222222  0.1111111  0.3333333
FDR: False Discovery Rate                    0.4        0.5        0.4
FNR: Miss Rate                                 0  0.6666667        0.5
ACC: Accuracy                          0.8333333       0.75  0.5833333
F1 score                                    0.75        0.4  0.5454545
MCC: Matthews correlation coefficient  0.6831301  0.2581989  0.1690309
Informedness                           0.7777778  0.2222222  0.1666667
Markedness                                   0.6        0.3  0.1714286
Prevalence                                  0.25       0.25        0.5
LR+: Positive likelihood ratio               4.5          3        1.5
LR-: Negative likelihood ratio                 0       0.75       0.75
DOR: Diagnostic odds ratio                   inf          4          2
FOR: False omission rate                       0        0.2  0.4285714
Run Code Online (Sandbox Code Playgroud)

我注意到一个名为PyCM的关于混淆矩阵的新Python库已经出局:也许你可以看一下.

  • `df_conf_norm = df_confusion / df_confusion.sum(axis=1)` 没有创建归一化的混淆矩阵:行应该总和为 1。你实际上需要:`df_confusion.values / df_confusion.sum(axis=1)[:,None ]` 虽然这会创建一个 numpy 数组,因为熊猫会在没有 `.values` 的情况下抱怨。请参阅:/sf/ask/1372153121/ (2认同)

bea*_*rdc 13

Scikit-learn(我推荐使用它仍然是)包含在metrics模块中:

>>> from sklearn.metrics import confusion_matrix
>>> y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 0, 0, 0, 1, 1, 0, 2, 2]
>>> confusion_matrix(y_true, y_pred)
array([[3, 0, 0],
       [1, 1, 1],
       [1, 1, 1]])
Run Code Online (Sandbox Code Playgroud)


cgn*_*utt 12

已经过去了将近十年,但是针对此职位的解决方案(没有sklearn)令人费解并且不必要地冗长。可以使用Python在几行中干净地完成计算混淆矩阵的工作。例如:

import numpy as np

def compute_confusion_matrix(true, pred):
  '''Computes a confusion matrix using numpy for two np.arrays
  true and pred.

  Results are identical (and similar in computation time) to: 
    "from sklearn.metrics import confusion_matrix"

  However, this function avoids the dependency on sklearn.'''

  K = len(np.unique(true)) # Number of classes 
  result = np.zeros((K, K))

  for i in range(len(true)):
    result[true[i]][pred[i]] += 1

  return result
Run Code Online (Sandbox Code Playgroud)

  • 并且你可以使用 `@numba.jit` 使其速度提高 10 倍以上:numpy:每个循环 83 毫秒,numba:每个循环 2.4 毫秒(第一次调用除外) (2认同)
  • @Ali Gröch - 只需使用辅助函数映射标签即可。辅助函数可以像“return dict(zip(range(len(np.unique(labels))), np.unique(labels)))”一样简单 (2认同)

Bol*_*ain 10

如果你不想scikit-学习为你做的工作......

    import numpy
    actual = numpy.array(actual)
    predicted = numpy.array(predicted)

    # calculate the confusion matrix; labels is numpy array of classification labels
    cm = numpy.zeros((len(labels), len(labels)))
    for a, p in zip(actual, predicted):
        cm[a][p] += 1

    # also get the accuracy easily with numpy
    accuracy = (actual == predicted).sum() / float(len(actual))
Run Code Online (Sandbox Code Playgroud)

或者在NLTK中查看更完整的实现.


Gre*_*ick 6

无依赖的多类混淆矩阵

# A Simple Confusion Matrix Implementation
def confusionmatrix(actual, predicted, normalize = False):
    """
    Generate a confusion matrix for multiple classification
    @params:
        actual      - a list of integers or strings for known classes
        predicted   - a list of integers or strings for predicted classes
        normalize   - optional boolean for matrix normalization
    @return:
        matrix      - a 2-dimensional list of pairwise counts
    """
    unique = sorted(set(actual))
    matrix = [[0 for _ in unique] for _ in unique]
    imap   = {key: i for i, key in enumerate(unique)}
    # Generate Confusion Matrix
    for p, a in zip(predicted, actual):
        matrix[imap[p]][imap[a]] += 1
    # Matrix Normalization
    if normalize:
        sigma = sum([sum(matrix[imap[i]]) for i in unique])
        matrix = [row for row in map(lambda i: list(map(lambda j: j / sigma, i)), matrix)]
    return matrix
Run Code Online (Sandbox Code Playgroud)

这里的方法是将actual向量中找到的唯一类配对成一个二维列表。从这里,我们简单地迭代通过压缩actualpredicted使用索引来访问矩阵位置向量和填充计数。

用法

cm = confusionmatrix(
    [1, 1, 2, 0, 1, 1, 2, 0, 0, 1], # actual
    [0, 1, 1, 0, 2, 1, 2, 2, 0, 2]  # predicted
)

# And The Output
print(cm)
[[2, 1, 0], [0, 2, 1], [1, 2, 1]]
Run Code Online (Sandbox Code Playgroud)

注:actual班是沿着列和predicted类是沿行。

# Actual
# 0  1  2
  #  #  #   
[[2, 1, 0], # 0
 [0, 2, 1], # 1  Predicted
 [1, 2, 1]] # 2
Run Code Online (Sandbox Code Playgroud)

类名可以是字符串或整数

cm = confusionmatrix(
    ["B", "B", "C", "A", "B", "B", "C", "A", "A", "B"], # actual
    ["A", "B", "B", "A", "C", "B", "C", "C", "A", "C"]  # predicted
)

# And The Output
print(cm)
[[2, 1, 0], [0, 2, 1], [1, 2, 1]]
Run Code Online (Sandbox Code Playgroud)

您还可以返回具有比例的矩阵(归一化)

cm = confusionmatrix(
    ["B", "B", "C", "A", "B", "B", "C", "A", "A", "B"], # actual
    ["A", "B", "B", "A", "C", "B", "C", "C", "A", "C"], # predicted
    normalize = True
)

# And The Output
print(cm)
[[0.2, 0.1, 0.0], [0.0, 0.2, 0.1], [0.1, 0.2, 0.1]]
Run Code Online (Sandbox Code Playgroud)

更强大的解决方案

自从写这篇文章以来,我已经将我的库实现更新为一个在内部使用混淆矩阵表示来计算统计数据的类,除了漂亮地打印混淆矩阵本身。看到这个要点

示例用法

# Actual & Predicted Classes
actual      = ["A", "B", "C", "C", "B", "C", "C", "B", "A", "A", "B", "A", "B", "C", "A", "B", "C"]
predicted   = ["A", "B", "B", "C", "A", "C", "A", "B", "C", "A", "B", "B", "B", "C", "A", "A", "C"]

# Initialize Performance Class
performance = Performance(actual, predicted)

# Print Confusion Matrix
performance.tabulate()
Run Code Online (Sandbox Code Playgroud)

随着输出:

===================================
        A?      B?      C?

A?      3       2       1

B?      1       4       1

C?      1       0       4

Note: class? = Predicted, class? = Actual
===================================
Run Code Online (Sandbox Code Playgroud)

对于归一化矩阵:

# Print Normalized Confusion Matrix
performance.tabulate(normalized = True)
Run Code Online (Sandbox Code Playgroud)

使用归一化输出:

===================================
        A?      B?      C?

A?      17.65%  11.76%  5.88%

B?      5.88%   23.53%  5.88%

C?      5.88%   0.00%   23.53%

Note: class? = Predicted, class? = Actual
===================================
Run Code Online (Sandbox Code Playgroud)


Tob*_*obu 0

您应该从类映射到混淆矩阵中的一行。

这里的映射很简单:

def row_of_class(classe):
    return {1: 0, 2: 1}[classe]
Run Code Online (Sandbox Code Playgroud)

在循环中,计算expected_rowcorrect_row和增量conf_arr[expected_row][correct_row]。您甚至会拥有比开始时更少的代码。