快速获取下标行的平均值的方法

Question

快速获取下标行的平均值的方法

zlo*_*lon 3 performance matlab matrix mean

我有一个数据,可以通过以下方式模拟:

N = 10^6;%10^8;
K = 10^4;%10^6; 

subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;

Run Code Online (Sandbox Code Playgroud)

换句话说,它是一个矩阵,最后一行是下标.现在我想计算nanmean()每个下标.另外,我想保存每个下标的行数.我有一个'虚拟'代码:

uniqueSubs = unique(M(:,6));
avM = nan(numel(uniqueSubs),6);
for iSub = 1:numel(uniqueSubs)
    tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
    avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1)];
end

Run Code Online (Sandbox Code Playgroud)

问题是,它太慢了.我希望它适用于(N = 10^8并K = 10^6参见这些变量定义中的注释部分).

如何以更快的方式找到数据的平均值？

Answer 1

Edr*_*ric 6

这听起来像是一个完美的工作findgroups和splitapply.

% Find groups in the final column
G = findgroups(M(:,6));
% function to apply per group
fcn = @(group) [mean(group, 1, 'omitnan'), size(group, 1)];
% Use splitapply to apply fcn to each group in M(:,1:5)
result = splitapply(fcn, M(:, 1:5), G);
% Check
assert(isequaln(result, avM));

Run Code Online (Sandbox Code Playgroud)

Answer 2

Adr*_*aan 5

M = sortrows(M,6); % sort the data per subscript
IDX = diff(M(:,6)); % find where the subscript changes
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)]; % add start and end of data
for iSub= 2:numel(tmp)
    % Calculate the mean over just a single subscript, store in iSub-1
    avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];tmp(iSub-1)];
end

Run Code Online (Sandbox Code Playgroud)

这比您计算机上的原始代码快60倍。加速主要来自对数据进行预排序，然后找到下标更改的所有位置。这样，您不必每次都遍历整个数组以找到正确的下标，而是只需要检查每次迭代的必要条件。因此，您可以计算约100行的平均值，而不必首先检查1,000,000行是否需要进行迭代。

因此：在原始numel(uniqueSubs)情况下，您检查10,000，在这种情况下，N这里的全部1,000,000 是否全部属于某个类别，所以将进行10 ^ 12的检查。建议的代码对行进行排序（排序为NlogN，因此此处为6,000,000），然后在整个数组上循环一次，而无需其他检查。

为了完整起见，这是原始代码以及我的版本，它显示两者是相同的：

N = 10^6;%10^8;
K = 10^4;%10^6; 

subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;

uniqueSubs = unique(M(:,6));
%% zlon's original code
avM = nan(numel(uniqueSubs),7); % add the subscript for comparison later
tic
uniqueSubs = unique(M(:,6));
for iSub = 1:numel(uniqueSubs)
    tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
    avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1) uniqueSubs(iSub)];
end
toc
%%%%% End of zlon's code
avM = sortrows(avM,7); % Sort for comparison

%% Start of Adriaan's code
avM2 = nan(numel(uniqueSubs),6);
tic
M = sortrows(M,6);
IDX = diff(M(:,6));
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)];
for iSub = 2:numel(tmp)
    avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];
end
toc %tic/toc should not be used for accurate timing, this is just for order of magnitude
%%%% End of Adriaan's code

all(avM(:,1:6) == avM2) % Do the comparison
% End of script

% Output
Elapsed time is 58.561347 seconds.
Elapsed time is 0.843124 seconds. % ~70 times faster

ans =

  1×6 logical array

   1   1   1   1   1   1 % i.e. the matrices are equal to one another

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	90 次
最近记录：	7 年，4 月前