C中的滚动中值算法

AWB*_*AWB 109 c algorithm statistics r median

我目前正致力于在C中实现滚动中值滤波器(类似于滚动均值滤波器)的算法.从我对文献的研究中,似乎有两种合理有效的方法.第一种是对值的初始窗口进行排序,然后执行二进制搜索以插入新值并在每次迭代时删除现有值.

第二个(来自Hardle和Steiger,1995,JRSS-C,算法296)构建了一个双端堆结构,一端是maxheap,另一端是minheap,中间是中间值.这产生线性时间算法而不是O(n log n).

这是我的问题:实现前者是可行的,但我需要在数百万个时间序列中运行它,因此效率很重要.后者证明非常难以实施.我在R的stats包的代码的Trunmed.c文件中找到了代码,但它是相当难以理解的.

有没有人知道线性时间滚动中值算法的编写良好的C实现?

修改:链接到Trunmed.c代码http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#mYw3h_Lb_e0/R-2.2.0/src/library/stats/src/Trunmed.c

Dir*_*tel 26

我已经看了src/library/stats/src/Trunmed.c几次R ,因为我想在一个独立的C++ class/C子例程中也有类似的东西.请注意,这实际上是两个实现中的一个,请参阅src/library/stats/man/runmed.Rd(帮助文件的来源)

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}
Run Code Online (Sandbox Code Playgroud)

很高兴看到这种方式以更独立的方式重复使用.你是志愿者吗?我可以帮助一些R位.

编辑1:除了上面的旧版Trunmed.c的链接,这里是当前的SVN副本

编辑2:Ryan Tibshirani在快速中值合并上有一些C和Fortran代码,这可能是窗口方法的合适起点.

  • @AWB这个想法最终发生了什么?您是否将解决方案整合到一个包中? (8认同)

Leo*_*adt 17

我找不到具有顺序统计的c ++数据结构的现代实现,因此最终在MAK建议的顶级编码器链接中实现这两个想法(匹配编辑:向下滚动到FloatingMedian).

两个multisets

第一个想法将数据划分为两个数据结构(堆,多字节等),每次插入/删除O(ln N)不允许动态更改分位数而无需大量成本.也就是说,我们可以有一个滚动中位数,或者滚动75%但不能同时滚动.

细分树

第二种想法使用O(ln N)的分段树进行插入/删除/查询,但更灵活.最重要的是"N"是您的数据范围的大小.因此,如果您的滚动中位数有一百万个项目的窗口,但您的数据从1..65536变化,那么滚动窗口的每次移动只需要16次操作!

c ++代码类似于Denis上面发布的代码("这是一个简单的量化数据算法")

GNU订单统计树

在放弃之前,我发现stdlibc ++包含订单统计树!

这些有两个关键操作:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)
Run Code Online (Sandbox Code Playgroud)

请参阅libstdc ++手册policy_based_data_structures_test(搜索"拆分和连接").

我已经将树包装在一个便利标题中,用于支持c ++ 0x/c ++ 11样式的部分typedef的编译器:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H
Run Code Online (Sandbox Code Playgroud)


ASh*_*lly 15

我在这里做了一个C实现.在这个问题中还有一些细节:在C - Turlach实现中滚动中位数.

样品用法:

int main(int argc, char* argv[])
{
   int i,v;
   Mediator* m = MediatorNew(15);

   for (i=0;i<30;i++)
   {
      v = rand()&127;
      printf("Inserting %3d \n",v);
      MediatorInsert(m,v);
      v=MediatorMedian(m);
      printf("Median = %3d.\n\n",v);
      ShowTree(m);
   }
}
Run Code Online (Sandbox Code Playgroud)

  • 基于min-median-max堆的优秀,快速和清晰的实现.非常好的工作. (6认同)

Tyl*_*ter 9

我使用这个增量中位数估算器:

median += eta * sgn(sample - median)
Run Code Online (Sandbox Code Playgroud)

它与更常见的平均估计量具有相同的形式:

mean += eta * (sample - mean)
Run Code Online (Sandbox Code Playgroud)

这里eta是一个小的学习速率参数(例如0.001),并且sgn()是返回其中一个的signum函数{-1, 0, 1}.(eta如果数据是非平稳的并且您想跟踪随时间的变化,请使用这样的常量;否则,对于固定源使用类似eta = 1 / n收敛的内容,n到目前为止看到的样本数量在哪里.)

此外,我修改了中值估计量,使其适用于任意分位数.通常,分位数函数会告诉您将数据分成两个分数的值:p1 - p.以下是逐步估算此值:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
Run Code Online (Sandbox Code Playgroud)

该值p应该在[0, 1].这实质上将sgn()函数的对称输出{-1, 0, 1}向一侧倾斜,将数据样本划分为两个不等大小的区间(分数p1 - p数据分别小于/大于分位数估计).请注意,因为p = 0.5这会降低到中位数估算值.

  • 对于类似的技术,请参见有关节俭流的这篇论文:http://arxiv.org/pdf/1407.1121v1.pdf它可以估计任何四分位数并适应均值的变化。它只需要存储两个值:最后一个估计值和最后一个调整方向(+1或-1)。该算法易于实现。我发现错误大约在97%的时间内在5%以内。 (3认同)
  • 太酷了,这是一个根据运行平均值调整'eta'的修改...(均值用作中位数的粗略估计,因此它以收敛于微小值的相同速率收敛于大值)。即eta自动调整。http://stackoverflow.com/questions/11482529/median-filter-super-ficient-implementation/15150968#15150968 (2认同)

den*_*nis 8

这是一个简单的量化数据算法(几个月后):

""" median1.py: moving median 1d for quantized, e.g. 8-bit data

Method: cache the median, so that wider windows are faster.
    The code is simple -- no heaps, no trees.

Keywords: median filter, moving median, running median, numpy, scipy

See Perreault + Hebert, Median Filtering in Constant Time, 2007,
    http://nomis80.org/ctmf.html: nice 6-page paper and C code,
    mainly for 2d images

Example:
    y = medians( x, window=window, nlevel=nlevel )
    uses:
    med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
    med.addsub( +, - )  -- see the picture in Perreault
    m = med.median()  -- using cached m, summ

How it works:
    picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
        counts: . 1 . . 1 . 1 .
        sums:   0 1 1 1 2 2 3 3
                        ^ sums[3] < 2 <= sums[4] <=> median 4
        addsub( 0, 1 )  m, summ stay the same
        addsub( 5, 1 )  slide right
        addsub( 5, 6 )  slide left

Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)

See also:
    scipy.signal.medfilt -- runtime roughly ~ window size
    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

"""

from __future__ import division
import numpy as np  # bincount, pad0

__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"


#...............................................................................
class Median1:
    """ moving median 1d for quantized, e.g. 8-bit data """

    def __init__( s, nlevel, window, counts ):
        s.nlevel = nlevel  # >= len(counts)
        s.window = window  # == sum(counts)
        s.half = (window // 2) + 1  # odd or even
        s.setcounts( counts )

    def median( s ):
        """ step up or down until sum cnt to m-1 < half <= sum to m """
        if s.summ - s.cnt[s.m] < s.half <= s.summ:
            return s.m
        j, sumj = s.m, s.summ
        if sumj <= s.half:
            while j < s.nlevel - 1:
                j += 1
                sumj += s.cnt[j]
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        else:
            while j > 0:
                sumj -= s.cnt[j]
                j -= 1
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        s.m, s.summ = j, sumj
        return s.m

    def addsub( s, add, sub ):
        s.cnt[add] += 1
        s.cnt[sub] -= 1
        assert s.cnt[sub] >= 0, (add, sub)
        if add <= s.m:
            s.summ += 1
        if sub <= s.m:
            s.summ -= 1

    def setcounts( s, counts ):
        assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
        if len(counts) < s.nlevel:
            counts = pad0__( counts, s.nlevel )  # numpy array / list
        sumcounts = sum(counts)
        assert sumcounts == s.window, (sumcounts, s.window)
        s.cnt = counts
        s.slowmedian()

    def slowmedian( s ):
        j, sumj = -1, 0
        while sumj < s.half:
            j += 1
            sumj += s.cnt[j]
        s.m, s.summ = j, sumj

    def __str__( s ):
        return ("median %d: " % s.m) + \
            "".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])

#...............................................................................
def medianfilter( x, window, nlevel=256 ):
    """ moving medians, y[j] = median( x[j:j+window] )
        -> a shorter list, len(y) = len(x) - window + 1
    """
    assert len(x) >= window, (len(x), window)
    # np.clip( x, 0, nlevel-1, out=x )
        # cf http://scipy.org/Cookbook/Rebinning
    cnt = np.bincount( x[0:window] )
    med = Median1( nlevel=nlevel, window=window, counts=cnt )
    y = (len(x) - window + 1) * [0]
    y[0] = med.median()
    for j in xrange( len(x) - window ):
        med.addsub( x[j+window], x[j] )
        y[j+1] = med.median()
    return y  # list
    # return np.array( y )

def pad0__( x, tolen ):
    """ pad x with 0 s, numpy array or list """
    n = tolen - len(x)
    if n > 0:
        try:
            x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
        except NameError:
            x += n * [0]
    return x

#...............................................................................
if __name__ == "__main__":
    Len = 10000
    window = 3
    nlevel = 256
    period = 100

    np.set_printoptions( 2, threshold=100, edgeitems=10 )
    # print medians( np.arange(3), 3 )

    sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
        + 1) * (nlevel-1) / 2
    x = np.asarray( sinwave, int )
    print "x:", x
    for window in ( 3, 31, 63, 127, 255 ):
        if window > Len:  continue
        print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
            y = medianfilter( x, window=window, nlevel=nlevel )
        print np.array( y )

# end median1.py
Run Code Online (Sandbox Code Playgroud)