在现代x86硬件上编写比特流的最快方法

Question

在现代x86硬件上编写比特流的最快方法

Mag*_*nus 11 c++ optimization x86 bit-manipulation

在x86/x86-64上写入比特流的最快方法是什么？(代码字<= 32位)

通过写一个比特流,我指的是将可变比特长度符号连接成一个连续的内存缓冲区的过程.

目前我有一个带有32位中间缓冲区的标准容器可以写入

void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
    if(bits_to_write < bits_left_in_buffer){
        buffer|= codeword << (32-bits_left_in_buffer);
        bits_left_in_buffer -= bits_to_write;

    }else{
        unsigned int full_bits = bits_to_write - bits_left_in_buffer;
        unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
        buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
        dst.push_back(towrite);
        bits_left_in_buffer = 32-full_bits;
    }
}

Run Code Online (Sandbox Code Playgroud)

有没有人知道任何好的优化,快速指令或其他可能有用的信息？

干杯,

Answer 1

rus*_*lik 6

我曾写过一个相当快的实现,但它有一些限制:当你写和读取比特流时,它可以在32位x86上工作.我不在这里检查缓冲区限制,我正在分配更大的缓冲区并不时从调用代码中检查它.

unsigned char* membuff; 
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb

// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;  // rounding down by 2.
    unsigned int bits = *(unsigned int*)(membuff + byte_offset);
    bits >>= bit_pos & 0xF;
    bit_pos += bit_cnt;
    return bits & BIT_MASKS[bit_cnt];
};

// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;
    *(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
    bit_pos += bit_cnt;
};

Run Code Online (Sandbox Code Playgroud)

Answer 2

seh*_*ehe 1

我没有时间为你写（不太确定你的样本实际上是否足够完整）但如果你必须的话，我可以想到

使用各种输入/输出位移位偏移的转换表；这种优化对于固定的位单位是有意义的n（具有n足够大的（8位？）以期望性能提升）本质上，你可以这样做
```
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
```
Run Code Online (Sandbox Code Playgroud)

免责声明：这是非常草率的伪代码，我只是希望它传达了我对查找表的想法，以防止位移算术

用汇编语言编写它（我知道 i386 有XLAT，但话又说回来，一个好的编译器可能已经使用了类似的东西）；另外，XLAT 似乎仅限于 8 位和 AL 寄存器，因此它并不是真正通用

更新

警告：请务必使用分析器并测试优化的正确性和速度。根据引用的局部性，使用查找表可能会导致性能较差。因此，您可能需要更改单个核心上的位流线程（设置线程关联）才能获得好处，并且您可能必须根据处理器的 L2 缓存调整查找表大小。

另外，如果您知道可以使用某些功能，请查看SIMD、SSE4或GPU (CUDA)指令集。

查找表很有趣。根据我的经验，“就地”处理往往比现代机器上的查找表（用于低级操作）更快，所以我不相信。我一定会尝试一下，看看它的表现如何。 (3认同)

归档时间：	14 年，7 月前
查看次数：	4593 次
最近记录：	8 年，11 月前