考虑每行上所有可能的排列,找到单元格数组的唯一行

use*_*148 8 performance matlab permutation cell-array

我有A维度的单元格数组m * k.

我想保持A唯一的行数达到k个单元格的顺序.

的"棘手"的部分是"高达K个小区的顺序":考虑k了在细胞中i的第i行A,A(i,:); 有可能是一个连续jA,A(j,:)即相当于A(i,:)达到它的重新排序k的细胞,这意味着,例如,如果k=4它可能是:

A{i,1}=A{j,2}
A{i,2}=A{j,3}
A{i,3}=A{j,1}
A{i,4}=A{j,4}
Run Code Online (Sandbox Code Playgroud)

我现在正在做的是:

G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
h=7;
M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A=cell(size(M,1),2);
for p=1:size(M,1)
    A{p,1}=squeeze(M(p,:,:)); 
    left=~ismember(G, A{p,1}, 'rows');
    A{p,2}=G(left,:); 
end

%To find equivalent rows up to order I use a double loop (VERY slow).
indices=[]; 
for j=1:size(A,1)
    if ismember(j,indices)==0 %if we have not already identified j as a duplicate
        for i=1:size(A,1)
            if i~=j
               if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                  &&...
                  (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                  indices=[indices;i]; 
               end
            end
        end
    end
end
A(indices,:)=[];
Run Code Online (Sandbox Code Playgroud)

它有效,但速度太慢.我希望有更快的东西可供我使用.

Dev*_*-iL 6

我想提出另一个想法,它与erfan想法有一些概念上的相似之处.我的想法使用哈希函数,特别是GetMD5FEX提交.

主要任务是如何将每行"减少" A为单个代表值(例如字符向量),然后找到该向量的唯一条目.

从基准测试与其他建议来看,我的答案并不像其中一个替代方案那样好,但我认为其存在的理由在于它完全是数据类型不可知的(在限制范围内)GetMD51),该算法非常简单易懂,它是一个直接替换,因为它操作A,并且结果数组完全等于原始方法获得的数组.当然,这需要编译器才能正常工作并且存在哈希冲突的风险(这可能会在非常罕见的情况下影响结果).

以下是我计算机上典型运行的结果,后面是代码:

Original method timing:     8.764601s
Dev-iL's method timing:     0.053672s
erfan's method timing:      0.481716s
rahnema1's method timing:   0.009771s
Run Code Online (Sandbox Code Playgroud)

function q39955559
G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
h=7;
M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A=cell(size(M,1),2);
for p=1:size(M,1)
    A{p,1}=squeeze(M(p,:,:)); 
    left=~ismember(G, A{p,1}, 'rows');
    A{p,2}=G(left,:); 
end

%% Benchmark:
tic
A1 = orig_sort(A);
fprintf(1,'Original method timing:\t\t%fs\n',toc);

tic
A2 = hash_sort(A);
fprintf(1,'Dev-iL''s method timing:\t\t%fs\n',toc);

tic
A3 = erfan_sort(A);
fprintf(1,'erfan''s method timing:\t\t%fs\n',toc);

tic
A4 = rahnema1_sort(G,h);
fprintf(1,'rahnema1''s method timing:\t%fs\n',toc);

assert(isequal(A1,A2))
assert(isequal(A1,A3))
assert(isequal(numel(A1),numel(A4)))  % This is the best test I could come up with...

function out = hash_sort(A)
% Hash the contents:
A_hashed = cellfun(@GetMD5,A,'UniformOutput',false);
% Sort hashes of each row:
A_hashed_sorted = A_hashed;
for ind1 = 1:size(A_hashed,1)
  A_hashed_sorted(ind1,:) = sort(A_hashed(ind1,:));
end
A_hashed_sorted = cellstr(cell2mat(A_hashed_sorted));
% Find unique rows:
[~,ia,~] = unique(A_hashed_sorted,'stable');
% Extract relevant rows of A:
out = A(ia,:);

function A = orig_sort(A)
%To find equivalent rows up to order I use a double loop (VERY slow).
indices=[]; 
for j=1:size(A,1)
    if ismember(j,indices)==0 %if we have not already identified j as a duplicate
        for i=1:size(A,1)
            if i~=j
               if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                  &&...
                  (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                  indices=[indices;i]; 
               end
            end
        end
    end
end
A(indices,:)=[];

function C = erfan_sort(A)
STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false);
[~, ~, id] = unique(STR);
IC = sort(reshape(id, [], size(STR, 2)), 2);
[~, col] = unique(IC, 'rows');
C = A(sort(col), :); % 'sort' makes the outputs exactly the same.

function A1 = rahnema1_sort(G,h)
idx = nchoosek(1:size(G,1),h);
%concatenate complements
M = [G(idx(1:size(idx,1)/2,:),:), G(idx(end:-1:size(idx,1)/2+1,:),:)];
%convert to cell so A1 is unique rows of A
A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));
Run Code Online (Sandbox Code Playgroud)

1 - 如果需要对更复杂的数据类型进行哈希处理,可以使用DataHashFEX提交,这有点慢.

  • 太好了!实际上,为了在覆盖一般情况时最有效,应该使用我的`GetMD5`想法和我的排序方法! (2认同)