ermig1979有一个Simd项目,它显示了他如何使用与@Paul-R提到的类似方法完成直方图,但也使用SSE2和AVX2变体:
项目:https://github.com/ermig1979/Simd
基本文件:https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdBaseHistogram.cpp
可以在这里看到AVX2实现:https: //github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp
下面可以看到一个标量解决方案来说明创建多个直方图的基本原理,这些直方图在最后总结:
void Histogram(const uint8_t * src, size_t width, size_t height, size_t stride,
uint32_t * histogram)
{
uint32_t histograms[4][HISTOGRAM_SIZE];
memset(histograms, 0, sizeof(uint32_t)*HISTOGRAM_SIZE*4);
size_t alignedWidth = Simd::AlignLo(width, 4);
for(size_t row = 0; row < height; ++row)
{
size_t col = 0;
for(; col < alignedWidth; col += 4)
{
++histograms[0][src[col + 0]];
++histograms[1][src[col + 1]];
++histograms[2][src[col + 2]];
++histograms[3][src[col + 3]];
}
for(; col < width; ++col)
++histograms[0][src[col + 0]];
src += stride;
}
for(size_t i = 0; i < HISTOGRAM_SIZE; ++i)
histogram[i] = histograms[0][i] + histograms[1][i] +
histograms[2][i] + histograms[3][i];
}
Run Code Online (Sandbox Code Playgroud)
遗憾的是,直方图几乎不可能进行矢量化.
你可以稍微优化标量代码 - 一个常见的技巧是使用两个直方图,然后在最后组合它们.这允许您重叠加载/增量/存储,从而掩盖一些串行依赖关系和相关的延迟.伪代码:
init histogram 1 to all 0s
init histogram 2 to all 0s
loop
get input value 1
get input value 2
load count for value 1 from histogram 1
load count for value 2 from histogram 2
increment count for histogram 1
increment count for histogram 2
store count for value 1 to histogram 1
store count for value 2 to histogram 2
until done
combine histogram 1 and histogram 2
Run Code Online (Sandbox Code Playgroud)