在MATLAB中使用clusterdata时出现内存不足错误

Hos*_*ein 6 matlab cluster-analysis hierarchical

我正在尝试聚类矩阵(大小:20057x2):

T = clusterdata(X,cutoff);
Run Code Online (Sandbox Code Playgroud)

但我得到这个错误:

??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.

Error in ==> pdist at 211
    Y = pdistmex(X',dist,additionalArg);

Error in ==> linkage at 139
       Z = linkagemex(Y,method,pdistArg);

Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);

Error in ==> kmeansTest at 2
T = clusterdata(X,1);

有人能帮我吗.我有4GB的内存,但认为问题来自其他地方..

Amr*_*mro 13

正如其他人所提到的,分层聚类需要计算成对距离矩阵是太大,不适合在内存中,你的情况.

请尝试使用K-Means算法:

numClusters = 4;
T = kmeans(X, numClusters);
Run Code Online (Sandbox Code Playgroud)

或者,您可以选择数据的随机子集,并将其用作聚类算法的输入.接下来,将聚类中心计算为每个聚类组的平均值/中值.最后,对于未在子集中选择的每个实例,您只需计算其到每个质心的距离,并将其分配给最接近的一个.

以下是用于说明上述想法的示例代码:

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight
Run Code Online (Sandbox Code Playgroud)

树状图 集群

  • 请注意,我在上面的示例中生成随机数据作为输入,此外我从该数据中随机选择了一个子集.因此,如果您使用特定数据集并始终选择相同的实例子集,则结果将是确定性的...请记住,您始终可以为截止值和子集大小变量尝试不同的值,直到您对结果满意为止 (2认同)