ncl*_*ent 2 algorithm cuda bit-manipulation
我对"扩展位"的快速方法感兴趣,可以定义如下:
1 << p[i] & B == 1
和| P | = kAp[j] == A[j] << p[j]
.几个例子:
以下是一个简单的算法,但我不禁感到有一种更快/更容易的方法.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
Run Code Online (Sandbox Code Playgroud)
问题似乎是要求对BMI2指令进行CUDA仿真,该指令PDEP
采用源操作数a
,并根据掩码的1位的位置存放其位b
.对于当前发运的GPU上的相同或类似操作,没有硬件支持; 也就是说,包括Maxwell架构.
基于给出的两个例子,我假设掩模b
通常是稀疏的,并且我们可以通过仅迭代1位来最小化工作b
.这可能会导致GPU上出现分歧,但在不知道特定用例的情况下,性能的确切权衡是未知的.就目前而言,我假设b
与分歧的负面影响相比,掩码中的稀疏性对性能的影响更大.
在下面的仿真代码中,我减少了可能"昂贵"的移位操作的使用,而主要依赖于简单的ALU指令.在各种GPU上,执行移位指令的吞吐量低于简单整数运算.我在代码的关键路径上保留了一个移位,以避免被算术单元限制执行.如果需要,1U << i
可以通过添加来替换表达式:引入一个在循环之前m
初始化的变量,并1
在每次循环时加倍.
基本思想是b
依次隔离每个1位掩码(从最低端开始),并将其与第i位的值隔离a
,并将结果合并到扩展目标中.使用1位后b
,我们将其从掩码中删除,并迭代直到掩码变为零.
为了避免将第i位移位a
到位,我们简单地将其隔离,然后通过简单的否定将其值复制到所有更高有效位,利用整数的二进制补码表示.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
Run Code Online (Sandbox Code Playgroud)
上面提到的没有任何移位操作的变体如下所示:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
Run Code Online (Sandbox Code Playgroud)
在下面的评论中,@ Evgeny Kluev指出PDEP
在国际象棋程序设计网站上的无移位仿真看起来可能比我上面两个实现中的任何一个都要快 ; 这似乎值得一试.