AWB*_*AWB 109 c algorithm statistics r median
我目前正致力于在C中实现滚动中值滤波器(类似于滚动均值滤波器)的算法.从我对文献的研究中,似乎有两种合理有效的方法.第一种是对值的初始窗口进行排序,然后执行二进制搜索以插入新值并在每次迭代时删除现有值.
第二个(来自Hardle和Steiger,1995,JRSS-C,算法296)构建了一个双端堆结构,一端是maxheap,另一端是minheap,中间是中间值.这产生线性时间算法而不是O(n log n).
这是我的问题:实现前者是可行的,但我需要在数百万个时间序列中运行它,因此效率很重要.后者证明非常难以实施.我在R的stats包的代码的Trunmed.c文件中找到了代码,但它是相当难以理解的.
有没有人知道线性时间滚动中值算法的编写良好的C实现?
修改:链接到Trunmed.c代码http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#mYw3h_Lb_e0/R-2.2.0/src/library/stats/src/Trunmed.c
Dir*_*tel 26
我已经看了src/library/stats/src/Trunmed.c
几次R ,因为我想在一个独立的C++ class/C子例程中也有类似的东西.请注意,这实际上是两个实现中的一个,请参阅src/library/stats/man/runmed.Rd
(帮助文件的来源)
\details{
Apart from the end values, the result \code{y = runmed(x, k)} simply has
\code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
efficiently.
The two algorithms are internally entirely different:
\describe{
\item{"Turlach"}{is the Härdle-Steiger
algorithm (see Ref.) as implemented by Berwin Turlach.
A tree algorithm is used, ensuring performance \eqn{O(n \log
k)}{O(n * log(k))} where \code{n <- length(x)} which is
asymptotically optimal.}
\item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
which makes use of median \emph{updating} when one observation
enters and one leaves the smoothing window. While this performs as
\eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
considerably faster for small \eqn{k} or \eqn{n}.}
}
}
Run Code Online (Sandbox Code Playgroud)
很高兴看到这种方式以更独立的方式重复使用.你是志愿者吗?我可以帮助一些R位.
编辑1:除了上面的旧版Trunmed.c的链接,这里是当前的SVN副本
编辑2:Ryan Tibshirani在快速中值合并上有一些C和Fortran代码,这可能是窗口方法的合适起点.
Leo*_*adt 17
我找不到具有顺序统计的c ++数据结构的现代实现,因此最终在MAK建议的顶级编码器链接中实现这两个想法(匹配编辑:向下滚动到FloatingMedian).
第一个想法将数据划分为两个数据结构(堆,多字节等),每次插入/删除O(ln N)不允许动态更改分位数而无需大量成本.也就是说,我们可以有一个滚动中位数,或者滚动75%但不能同时滚动.
第二种想法使用O(ln N)的分段树进行插入/删除/查询,但更灵活.最重要的是"N"是您的数据范围的大小.因此,如果您的滚动中位数有一百万个项目的窗口,但您的数据从1..65536变化,那么滚动窗口的每次移动只需要16次操作!
c ++代码类似于Denis上面发布的代码("这是一个简单的量化数据算法")
在放弃之前,我发现stdlibc ++包含订单统计树!
这些有两个关键操作:
iter = tree.find_by_order(value)
order = tree.order_of_key(value)
Run Code Online (Sandbox Code Playgroud)
请参阅libstdc ++手册policy_based_data_structures_test(搜索"拆分和连接").
我已经将树包装在一个便利标题中,用于支持c ++ 0x/c ++ 11样式的部分typedef的编译器:
#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
T,
__gnu_pbds::null_type,
std::less<T>,
__gnu_pbds::rb_tree_tag,
// This policy updates nodes' metadata for order statistics.
__gnu_pbds::tree_order_statistics_node_update>;
#endif //GNU_ORDER_STATISTIC_SET_H
Run Code Online (Sandbox Code Playgroud)
ASh*_*lly 15
我在这里做了一个C实现.在这个问题中还有一些细节:在C - Turlach实现中滚动中位数.
样品用法:
int main(int argc, char* argv[])
{
int i,v;
Mediator* m = MediatorNew(15);
for (i=0;i<30;i++)
{
v = rand()&127;
printf("Inserting %3d \n",v);
MediatorInsert(m,v);
v=MediatorMedian(m);
printf("Median = %3d.\n\n",v);
ShowTree(m);
}
}
Run Code Online (Sandbox Code Playgroud)
我使用这个增量中位数估算器:
median += eta * sgn(sample - median)
Run Code Online (Sandbox Code Playgroud)
它与更常见的平均估计量具有相同的形式:
mean += eta * (sample - mean)
Run Code Online (Sandbox Code Playgroud)
这里eta是一个小的学习速率参数(例如0.001
),并且sgn()
是返回其中一个的signum函数{-1, 0, 1}
.(eta
如果数据是非平稳的并且您想跟踪随时间的变化,请使用这样的常量;否则,对于固定源使用类似eta = 1 / n
收敛的内容,n
到目前为止看到的样本数量在哪里.)
此外,我修改了中值估计量,使其适用于任意分位数.通常,分位数函数会告诉您将数据分成两个分数的值:p
和1 - p
.以下是逐步估算此值:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
Run Code Online (Sandbox Code Playgroud)
该值p
应该在[0, 1]
.这实质上将sgn()
函数的对称输出{-1, 0, 1}
向一侧倾斜,将数据样本划分为两个不等大小的区间(分数p
和1 - p
数据分别小于/大于分位数估计).请注意,因为p = 0.5
这会降低到中位数估算值.
这是一个简单的量化数据算法(几个月后):
""" median1.py: moving median 1d for quantized, e.g. 8-bit data
Method: cache the median, so that wider windows are faster.
The code is simple -- no heaps, no trees.
Keywords: median filter, moving median, running median, numpy, scipy
See Perreault + Hebert, Median Filtering in Constant Time, 2007,
http://nomis80.org/ctmf.html: nice 6-page paper and C code,
mainly for 2d images
Example:
y = medians( x, window=window, nlevel=nlevel )
uses:
med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
med.addsub( +, - ) -- see the picture in Perreault
m = med.median() -- using cached m, summ
How it works:
picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
counts: . 1 . . 1 . 1 .
sums: 0 1 1 1 2 2 3 3
^ sums[3] < 2 <= sums[4] <=> median 4
addsub( 0, 1 ) m, summ stay the same
addsub( 5, 1 ) slide right
addsub( 5, 6 ) slide left
Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)
See also:
scipy.signal.medfilt -- runtime roughly ~ window size
http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c
"""
from __future__ import division
import numpy as np # bincount, pad0
__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"
#...............................................................................
class Median1:
""" moving median 1d for quantized, e.g. 8-bit data """
def __init__( s, nlevel, window, counts ):
s.nlevel = nlevel # >= len(counts)
s.window = window # == sum(counts)
s.half = (window // 2) + 1 # odd or even
s.setcounts( counts )
def median( s ):
""" step up or down until sum cnt to m-1 < half <= sum to m """
if s.summ - s.cnt[s.m] < s.half <= s.summ:
return s.m
j, sumj = s.m, s.summ
if sumj <= s.half:
while j < s.nlevel - 1:
j += 1
sumj += s.cnt[j]
# print "j sumj:", j, sumj
if sumj - s.cnt[j] < s.half <= sumj: break
else:
while j > 0:
sumj -= s.cnt[j]
j -= 1
# print "j sumj:", j, sumj
if sumj - s.cnt[j] < s.half <= sumj: break
s.m, s.summ = j, sumj
return s.m
def addsub( s, add, sub ):
s.cnt[add] += 1
s.cnt[sub] -= 1
assert s.cnt[sub] >= 0, (add, sub)
if add <= s.m:
s.summ += 1
if sub <= s.m:
s.summ -= 1
def setcounts( s, counts ):
assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
if len(counts) < s.nlevel:
counts = pad0__( counts, s.nlevel ) # numpy array / list
sumcounts = sum(counts)
assert sumcounts == s.window, (sumcounts, s.window)
s.cnt = counts
s.slowmedian()
def slowmedian( s ):
j, sumj = -1, 0
while sumj < s.half:
j += 1
sumj += s.cnt[j]
s.m, s.summ = j, sumj
def __str__( s ):
return ("median %d: " % s.m) + \
"".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])
#...............................................................................
def medianfilter( x, window, nlevel=256 ):
""" moving medians, y[j] = median( x[j:j+window] )
-> a shorter list, len(y) = len(x) - window + 1
"""
assert len(x) >= window, (len(x), window)
# np.clip( x, 0, nlevel-1, out=x )
# cf http://scipy.org/Cookbook/Rebinning
cnt = np.bincount( x[0:window] )
med = Median1( nlevel=nlevel, window=window, counts=cnt )
y = (len(x) - window + 1) * [0]
y[0] = med.median()
for j in xrange( len(x) - window ):
med.addsub( x[j+window], x[j] )
y[j+1] = med.median()
return y # list
# return np.array( y )
def pad0__( x, tolen ):
""" pad x with 0 s, numpy array or list """
n = tolen - len(x)
if n > 0:
try:
x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
except NameError:
x += n * [0]
return x
#...............................................................................
if __name__ == "__main__":
Len = 10000
window = 3
nlevel = 256
period = 100
np.set_printoptions( 2, threshold=100, edgeitems=10 )
# print medians( np.arange(3), 3 )
sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
+ 1) * (nlevel-1) / 2
x = np.asarray( sinwave, int )
print "x:", x
for window in ( 3, 31, 63, 127, 255 ):
if window > Len: continue
print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
y = medianfilter( x, window=window, nlevel=nlevel )
print np.array( y )
# end median1.py
Run Code Online (Sandbox Code Playgroud)