ron*_*nag 12 c++ optimization 64-bit bit-manipulation
在我的代码中,以下行目前是热点:
int table1[256] = /*...*/;
int table2[512] = /*...*/;
int table3[512] = /*...*/;
int* result = /*...*/;
for(int r = 0; r < r_end; ++r)
{
std::uint64_t bits = bit_reader.value(); // 64 bits, no assumption regarding bits.
// The get_ functions are table lookups from the highest word of the bits variable.
struct entry
{
int sign_offset : 5;
int r_offset : 4;
int x : 7;
};
// NOTE: We are only interested in the highest word in the bits variable.
entry e;
if(is_in_table1(bits)) // branch prediction should work well here since table1 will be hit more often than 2 or 3, and 2 more often than 3.
e = reinterpret_cast<const entry&>(table1[get_table1_index(bits)]);
else if(is_in_table2(bits))
e = reinterpret_cast<const entry&>(table2[get_table2_index(bits)]);
else
e = reinterpret_cast<const entry&>(table3[get_table3_index(bits)]);
r += e.r_offset; // r is 18 bits, top 14 bits are always 0.
int x = e.x; // x is 14 bits, top 18 bits are always 0.
int sign_offset = e.sign_offset;
assert(sign_offset <= 16 && sign_offset > 0);
// The following is the hotspot.
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
(*result++) = ((x << 18) * sign) | r; // 32 bits
// End of hotspot
bit_reader.skip(sign_offset); // sign_offset is the last bit used.
}
Run Code Online (Sandbox Code Playgroud)
虽然我还没有弄清楚如何进一步优化这个,也许是来自Bit-Granularity操作的内在函数,__shiftleft128或者_rot可能有用吗?
请注意,我也正在处理GPU上的结果数据,因此重要的是获取resultGPU然后可用于计算正确数据的内容.
建议?
编辑:
添加了表查找.
编辑:
int sign = 1 - (bits >> (63 - e.sign_offset) & 0x2);
000000013FD6B893 and ecx,1Fh
000000013FD6B896 mov eax,3Fh
000000013FD6B89B sub eax,ecx
000000013FD6B89D movzx ecx,al
000000013FD6B8A0 shr r8,cl
000000013FD6B8A3 and r8d,2
000000013FD6B8A7 mov r14d,1
000000013FD6B8AD sub r14d,r8d
Run Code Online (Sandbox Code Playgroud)
我认为这是最快的解决方案:
*result++ = (_rotl64(bits, sign_offset) << 31) | (x << 18) | (r << 0); // 32 bits
Run Code Online (Sandbox Code Playgroud)
然后根据 GPU 上是否设置了符号位来纠正 x。