如何将聚类标签与 Matlab 中的“真实值”标签相匹配

Vin*_*ent 5 matlab cluster-analysis weka

我在这里搜索并用谷歌搜索,但没有结果。在 Weka 中进行聚类时,有一个方便的选项,即类到聚类,它将算法生成的聚类(例如简单的 k 均值)与您作为类属性提供的“基本事实”类标签相匹配。这样我们就可以看到聚类准确率(错误百分比)。

现在,我如何在Matlab中实现这一点,即将我的clusterClasses向量eg转换[1, 1, 2, 1, 3, 2, 3, 1, 1, 1]为与提供的地面真实标签向量eg相同的索引[2, 2, 2, 3, 1, 3]

我认为它可能是基于聚类中心和标签中心,但我不知道如何实现!

任何帮助将不胜感激。

文森特

Vid*_*dar 5

几个月前,我在进行聚类时偶然发现了类似的问题。我没有很长时间地搜索内置解决方案(尽管我确信它们一定存在),最终编写了我自己的小脚本,以将我找到的标签与事实真相进行最佳匹配。该代码非常粗糙,但它应该可以帮助您入门。

它基于尝试所有可能的标签重新排列,以查看最适合真相向量的内容。yte = [3 3 2 1]这意味着,给定具有真实值的聚类结果y = [1 1 2 3],脚本将尝试匹配[3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2]y找到最佳匹配。

这是基于使用内置脚本perms()无法处理超过 10 个独特的集群。对于 7-10 个独特的簇,代码也可能会很慢,因为复杂性会随着阶乘的增长而增长。

function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found 
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels, 
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit 
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%#   yte - result from clustering (y-test)
%#   y   - truth vector
%#
%# Output:
%#   accuracy    -   Overall accuracy for entire clustering (OA). For
%#                   overall error, use OE = 1 - OA.
%#   true_labels -   Vector giving the label rearangement witch best 
%#                   match the truth vector (y).
%#   CM          -   Confusion matrix. If unique(yte) = 4, produce a
%#                   4x4 matrix of the number of different errors and  
%#                   correct clusterings done.

N = length(y);

cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;

perm = perms(unique(y));
[pN pM] = size(perm);

true_labels = y;

for i=1:pN
    flipped_labels = zeros(1,N);
    for cl = 1 : pM
        flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
    end

    testAcc = sum(flipped_labels == y')/N;
    if testAcc > accuracy
        accuracy = testAcc;
        maxInd = i;
        true_labels = flipped_labels;
    end

end

CM = zeros(pM,pM);
for rc = 1 : pM
    for cc = 1 : pM
        CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
    end
end
Run Code Online (Sandbox Code Playgroud)

例子:

[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')

acc =

0.6667


newLabels =

 1     2     2     3     2     1


CM =

 1     0     0
 0     2     0
 1     1     1
Run Code Online (Sandbox Code Playgroud)