使用陈旧数据从随机区间数据集构建固定区间数据集

Question

使用陈旧数据从随机区间数据集构建固定区间数据集

更新:我对问题文本底部的三个答案进行了简要分析,并解释了我的选择.

我的问题:使用陈旧数据从随机区间数据集构建固定区间数据集的最有效方法是什么？

一些背景:以上是统计学中的常见问题.通常,人们在随机时间发生一系列观察.叫它Input.但是人们希望每5分钟发生一系列观察.叫它Output.构建此数据集的最常用方法之一是使用陈旧数据,Output即将每个观察设置为等于最近发生的观察Input.

所以,这里有一些构建示例数据集的代码:

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];

Run Code Online (Sandbox Code Playgroud)

两个数据集都在千禧年之际接近午夜开始.但是,时间戳Input以随机间隔发生,而时间戳Output以固定间隔发生.为简单起见,我确保第一次观察Input始终发生在第一次观察之前Output.随意在任何答案中做出这个假设.

目前,我解决了这样的问题:

sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
    if Input(t, 1) > Output(s, 1)
        %#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
        Output(s, 2:end) = Input(t-1, 2:end);
        s = s + 1;
        %#Check if we've filled out all observations in output
        if s > sMax
            break
        end
        %#This step is necessary in case we need to use the same input observation twice in a row
        t = t - 1;
    end
    t = t + 1;
    if t > tMax
        %#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output 
        Output(s:end, 2:end) = Input(end, 2:end);
        break
    end
end

Run Code Online (Sandbox Code Playgroud)

当然有一种更有效,或至少更优雅的方式来解决这个问题？正如我所提到的,这是统计学中的常见问题.也许Matlab有一些我不知道的内置函数？任何帮助都会非常感激,因为我对一些大型数据集使用这个例程很多.

回答:大家好,我已经分析了三个答案,而且他们认为,Angainor是最好的.

ChthonicDaemon的答案虽然显然最容易实现,但确实很慢.即使在timeseries速度测试之外完成对对象的转换也是如此.我猜这个resample函数目前有很多开销.我正在运行2011b,因此有可能Mathworks在此期间对其进行了改进.此外,对于Output结束多于一次观察的情况,该方法需要额外的线Input.

Rody的答案只比Angainor稍微慢一点(考虑到他们都采用这种histc方法,并不令人惊讶),但是,它似乎有一些问题.首先,分配最后一次观察的方法对于在最后Output一次观察之后Input发生的最后一次观察不稳健Output.这是一个简单的解决方案.但是我认为还有第二个问题源于InputTimeStamp作为第一个输入histc而不是OutputTimeStampAngainor 所采用的.如果在设置示例输入时更改OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';为问题,则会出现此问题OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)';.

对于我投入的所有东西,Angainor看起来都很强大,而且速度最快.

我针对不同的输入规格进行了大量的速度测试 - 以下数字具有相当的代表性:

我天真的循环: Elapsed time is 8.579535 seconds.

Angainor:Elapsed time is 0.661756 seconds.

罗迪: Elapsed time is 0.913304 seconds.

ChthonicDaemon: Elapsed time is 22.916844 seconds.

我正在使用Angainor的解决方案并标记问题已解决.

Answer 1

ang*_*nor 1

这是我对这个问题的看法。histc是要走的路：

% find Output timestamps in Input bins
N   = histc(Output(:,1), Input(:,1));

% find counts in the non-empty bins
counts = N(find(N));

% find Input signal value associated with every bin
val = Input(find(N),2);

% now, replicate every entry entry in val
% as many times as specified in counts
index = zeros(1,sum(counts));
index(cumsum([1 counts(1:end-1)'])) = 1;
index = cumsum(index);
val_rep = val(index)

% finish the signal with last entry from Input, as needed
val_rep(end+1:size(Output,1)) = Input(end,2);

% done
Output(:,2) = val_rep;

Run Code Online (Sandbox Code Playgroud)

我检查了您的程序中的几个不同的输入模型（我更改了输出时间戳的数量），结果是相同的。但是，我仍然不确定我理解你的问题，所以如果这里有问题请告诉我。

归档时间：	13 年，5 月前
查看次数：	442 次
最近记录：	10 年，3 月前