jam*_*o00 22 algorithm matlab sse simd
假设我有一个数组
k = [1 2 0 0 5 4 0]
我可以按如下方式计算掩码
m = k > 0 = [1 1 0 0 1 1 0]
仅使用掩码m和以下操作
我可以将k压缩成以下内容
[1 2 5 4]
这是我目前的做法(MATLAB伪代码):
function out = compact( in )
d = in
for i = 1:size(in, 2) %do (# of items in in) passes
m = d > 0
%shift left, pad w/ 0 on right
ml = [m(2:end) 0] % shift
dl = [d(2:end) 0] % shift
%if the data originally has a gap, fill it in w/ the
%left shifted one
use = (m == 0) & (ml == 1) %2 comparison
d = use .* dl + ~use .* d
%zero out elements that have been moved to the left
use_r = [0 use(1:end-1)]
d = d .* ~use_r
end
out = d(1 : size(find(in > 0), 2)) %truncate the end
end
Run Code Online (Sandbox Code Playgroud)
直觉
每次迭代,我们将掩模向左移动并比较掩模.如果我们发现在此移位之后,原来为void(mask [i] = 0)的索引现在有效(mask [i] = 1),我们设置索引以获得左移数据.
题
上述算法具有O(N*(3 shift + 2比较+ AND + add + 3 multiplies)).有没有办法提高效率?
Evg*_*uev 10
在原始的伪代码中没有太多优化.我在这看到几个小的改进:
use = (m == 0) & (ml == 1)可能可能简化为use = ~m & ml,~计为单独的操作,这将是最好使用反相形式:use = m | ~ml,d = ~use .* dl + use .* d,use_r = [1 use(1:end-1)],d = d .*use_r但是有可能发明更好的算法.算法的选择取决于使用的CPU资源:
C++,64位,子集宽度= 8:
typedef unsigned long long ull;
const ull h = 0x8080808080808080;
const ull l = 0x0101010101010101;
const ull end = 0xffffffffffffffff;
// uncompacted bytes
ull x = 0x0100802300887700;
// set hi bit for zero bytes (see D.Knuth, volume 4)
ull m = h & ~(x | ((x|h) - l));
// bitmask for nonzero bytes
m = ~(m | (m - (m>>7)));
// tail zero bytes need no special treatment
m |= (m - 1);
while (m != end)
{
ull tailm = m ^ (m + 1); // bytes to be processed
ull tailx = x & tailm; // get the bytes
tailm |= (tailm << 8); // shift 1 byte at a time
m |= tailm; // all processed bytes are masked
x = (x ^ tailx) | (tailx << 8); // actual byte shift
}
Run Code Online (Sandbox Code Playgroud)
所以你需要弄清楚额外的并行性,转移/混洗开销是否值得这么简单的任务.
for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
if(mask[inIdx] == 1) {
out[outIdx] = in[inIdx];
outIdx++;
}
}
Run Code Online (Sandbox Code Playgroud)
如果你想进行并行SIMD路由,你最好的选择是一个SWITCH CASE,其中包含掩码的下4位的所有可能的排列.为什么不是8?因为PSHUFD指令只能在XMMX m128而不是YMMX m256上进行随机播放.
所以你做了16个案例:
因此,每种情况都是最少量的处理(1到2个SIMD指令和1个输出指针添加).case语句的周围循环将处理常量输入指针加法(加4)和MOVDQA加载输入.