使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置

Question

使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置

pkt*_*der 13 c optimization x86 x86-64 simd

（相关：如何在Sandy Bridge上的一系列int中快速将位计数到单独的bin中？是对此的早期复制，带有一些不同的答案。编者注：这里的答案可能更好。

同样，是类似问题的AVX2版本，整行位的许多bin比一个宽得多uint64_t：改进列填充计数算法）

我正在C中的一个项目中，我需要经历数千万个掩码（ulong类型（64位）），并target基于一个简单规则更新64个短整数（uint16）的数组（称为）：

// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
    if (mask & (1ull << i)) {
        target[i]++
    }
}

Run Code Online (Sandbox Code Playgroud)

问题是我需要在数以百万计的蒙版上执行上述循环，并且我需要在不到一秒钟的时间内完成。想知道是否有任何方法可以加快它的速度，例如使用某种表示上述循环的特殊汇编指令。

目前，我在ubuntu 14.04（i7-2670QM，支持AVX，而不是AVX2）上使用gcc 4.8.4来编译和运行以下代码，大约需要2秒钟。很想让它在200ms以下运行。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];

int main(int argc, char *argv[]) {
    int i, j;
    unsigned long x = 123;
    unsigned long m = 1;
    char *p = malloc(8 * 10000000);
    if (!p) {
        printf("failed to allocate\n");
        exit(0);
    }
    memset(p, 0xff, 80000000);
    printf("p=%p\n", p);
    unsigned long *pLong = (unsigned long*)p;
    double start = getTS();
    for (j = 0; j < 10000000; j++) {
        m = 1;
        for (i = 0; i < 64; i++) {
            if ((pLong[j] & m) == m) {
                target[i]++;
            }
            m = (m << 1);
        }
    }
    printf("took %f secs\n", getTS() - start);
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

提前致谢！

Answer 1

chq*_*lie 13

在我的系统上，使用4年的MacBook（2.7 GHz Intel Core i5）clang-900.0.39.2 -O3，您的代码在500毫秒内运行。

只需将内部测试更改为if ((pLong[j] & m) != 0)节省30％的时间即可运行350ms。

进一步简化内部，target[i] += (pLong[j] >> i) & 1;无需测试即可将其降至280ms。

进一步的改进似乎需要更高级的技术，例如将钻头拆成8个ulong的块并将其并行添加，一次处理255 ulong。

这是使用此方法的改进版本。它在我的系统上运行45毫秒。

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate\n");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p\n", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }\n");
    printf("took %f secs\n", getTS() - start);
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

在答案中调查并解释了将字节扩展为64位长的技术：https : //stackoverflow.com/a/55059914/4593267。我将target数组和数组都设为局部变量，inflate然后打印结果以确保编译器不会优化计算。在生产版本中，您将inflate单独计算数组。

直接使用SIMD可能会提供进一步的改进，但会降低可移植性和可读性。这种优化通常最好交给编译器，因为它可以为目标体系结构生成特定的代码。除非性能至关重要，并且基准测试证明这是一个瓶颈，否则我将始终支持通用解决方案。

njuffa的另一种解决方案无需预先计算的阵列即可提供相似的性能。根据您的编译器和硬件的具体情况，它可能会更快。

@pktCoder：还请注意，chqrlie使用的是clang，通常会展开内部循环，而gcc不会启用循环展开，除非您使用`-fprofile-generate` /运行程序/`-fprofile-use`。另外，您的gcc4.8很旧，几乎比您的CPU更新。较新的gcc版本会更好地进行优化。 (3认同)

Answer 2

Pet*_*des 11

相关：较早的副本有一些替代的想法：如何在Sandy Bridge上的一系列int中快速将位计入单独的容器中？。哈罗德在每个位列上关于AVX2列填充计数算法的答案

另外：https : //github.com/mklarqvist/positional-popcount具有SSE混合，各种AVX2，各种AVX512（包括适用于大型阵列的Harley-Seal）以及各种其他用于位置popcount的算法。 可能仅适用于uint16_t，但大多数都可以适用于其他字宽。我认为我在下面提出的算法就是他们所说的adder_forest。

最好的选择是SIMD，在Sandybridge CPU上使用AVX1。编译器不够聪明，无法为您自动向量化循环位，即使您无分支地编写它也可以提供更好的机会。

不幸的是，它还不够智能，无法自动矢量化逐渐扩展和增加的快速版本。

看到intel avx2中的movemask指令是否有反指令？有关位图的摘要->不同大小的矢量解压缩方法。Ext3h在另一个答案中的建议是好的：将位解压缩为比最终计数数组更窄的位，可以为每个指令提供更多元素。使用SIMD可以提高字节效率，然后您最多可以进行255个垂直操作paddb而不会溢出，然后再解压缩以累加到32位计数器阵列中。

它只需要4个16字节__m128i向量即可容纳所有64个uint8_t元素，因此这些累加器可以保留在寄存器中，仅在扩展到外部循环中的32位计数器时才添加到内存中。

解压缩不一定要按顺序进行：target[]在累积所有结果之后，您始终可以在最后将随机播放一次。

可以展开内部循环以从64位或128位向量加载开始，并使用pshufb（_mm_shuffle_epi8）解压缩4或8种不同的方式。

更好的策略是逐步扩大

从2位累加器开始，然后进行屏蔽/移位以将其扩大到4位。因此，在最内层的循环中，大多数操作都在处理“密集”数据，而不是立即对其进行过多“稀释”。更高的信息/熵密度意味着每条指令都可以做更多有用的工作。

使用SWAR技术在标量或SIMD寄存器中进行32x 2位加法运算很容易/很便宜，因为我们始终需要避免执行元素顶部的可能性。如果使用正确的SIMD，我们将丢失这些计数，而使用SWAR，我们将破坏下一个元素。

uint64_t x = *(input++);        // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;  // 0b...01010101;

uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits;            // or use ANDN before shifting to avoid a MOV copy

accum2_lo += lo;   // can do up to 3 iterations of this without overflow
accum2_hi += hi;   // because a 2-bit integer overflows at 4

Run Code Online (Sandbox Code Playgroud)

然后，您最多重复4个4位元素的向量，然后重复8个8位元素的向量，然后应该将宽度扩展到32，并累加到内存中的数组中，因为无论如何都会耗尽寄存器，这外部外部循环工作很少，因此我们不必费心去使用16位。（特别是如果我们手动向量化）。

最大的缺点：与@njuffa的版本不同，这不会自动向量化。 但是gcc -O3 -march=sandybridge对于AVX1（然后在Skylake上运行代码）而言，此运行的标量64位实际上仍然比@njuffa的代码中的128位AVX自动矢量化asm 稍快。

但这是在Skylake上的时机，Skylake具有4个标量ALU端口（以及mov-消除），而Sandybridge缺少mov-消除，只有3个ALU端口，因此标量代码可能会遇到后端执行端口瓶颈。（但是SIMD代码可能几乎一样快，因为在移位中混合了很多AND / ADD，并且SnB的所有3个端口上确实都有SIMD执行单元，上面有任何ALU。Haswell只是为标量添加了端口6， -仅包括班次和分支。）

通过良好的手动矢量化，这应该快将近2或4倍。

但是，如果必须在此标量或具有AVX2自动矢量化功能的@njuffa之间进行选择，则在Skylake上使用@njuffa可以更快 -march=native

如果有可能/需要在32位目标上进行构建，这会遭受很多（由于在32位寄存器中使用uint64_t而没有进行矢量化），而矢量化代码几乎不会受到任何影响（因为所有工作都发生在相同的向量reg中）宽度）。

// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i

void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
    const uint64_t *endp = input + length - 3*4*21;     // 252 masks per outer iteration
    while(input <= endp) {
        uint64_t accum8[8] = {0};     // 8-bit accumulators
        for (int k=0 ; k<21 ; k++) {
            uint64_t accum4[4] = {0};  // 4-bit accumulators can hold counts up to 15.  We use 4*3=12
            for(int j=0 ; j<4 ; j++){
                uint64_t accum2_lo=0, accum2_hi=0;
                for(int i=0 ; i<3 ; i++) {  // the compiler should fully unroll this
                    uint64_t x = *input++;    // load a new bitmask
                    const uint64_t even_1bits = 0x5555555555555555;
                    uint64_t lo = x & even_1bits; // 0b...01010101;
                    uint64_t hi = (x>>1) & even_1bits;  // or use ANDN before shifting to avoid a MOV copy
                    accum2_lo += lo;
                    accum2_hi += hi;   // can do up to 3 iterations of this without overflow
                }

                const uint64_t even_2bits = 0x3333333333333333;
                accum4[0] +=  accum2_lo       & even_2bits;  // 0b...001100110011;   // same constant 4 times, because we shift *first*
                accum4[1] += (accum2_lo >> 2) & even_2bits;
                accum4[2] +=  accum2_hi       & even_2bits;
                accum4[3] += (accum2_hi >> 2) & even_2bits;
            }
            for (int i = 0 ; i<4 ; i++) {
                accum8[i*2 + 0] +=   accum4[i] & 0x0f0f0f0f0f0f0f0f;
                accum8[i*2 + 1] +=  (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
            }
        }

        // char* can safely alias anything.
        unsigned char *narrow = (uint8_t*) accum8;
        for (int i=0 ; i<64 ; i++){
            target[i] += narrow[i];
        }
    }
    /* target[0] = bit 0
     * target[1] = bit 8
     * ...
     * target[8] = bit 1
     * target[9] = bit 9
     * ...
     */
    // TODO: 8x8 transpose
}

Run Code Online (Sandbox Code Playgroud)

我们不在乎顺序，accum4[0]例如，每4位就有4位累加器。 最后需要（但尚未实现）的最终uint32_t target[64]修复是该阵列的8x8转置，使用unpck且vshufps仅使用AVX1即可有效完成。（使用AVX / AVX2转置8x8浮点数）。还有最后一个251个蒙版的清理循环。

我们可以使用任何SIMD元素宽度来实现这些转换。无论如何，我们都必须屏蔽宽度小于16位的宽度（SSE / AVX没有字节粒度的移位，最小仅为16位。）

添加了@njuffa的测试工具在Arch Linux i7-6700k上的基准测试结果。（Godbolt） N = (10000000 / (3*4*21) * 3*4*21) = 9999864（即10000000向下舍入为252个迭代“展开”因子的倍数，因此我的简化实现是在完成相同的工作量，不计算target[]不执行的重新排序，因此它会打印不匹配的结果。但是打印的计数与参考数组的另一个位置匹配。）

我连续运行了4次程序（以确保CPU已预热至最大Turbo），并进行了一次看起来不错的运行（3倍异常高的运行）。

ref：最佳位循环（下一节）
快速：@njuffa的代码。（使用128位AVX整数指令自动向量化）。
渐进式的：我的版本（不是由gcc或clang自动矢量化的，至少在内部循环中没有。）gcc和clang完全展开了内部的12个迭代。

gcc8.2 -O3 -march=sandybridge -fpie -no-pie
参考：0.331373秒，快速：0.011387秒，渐进：0.009966秒

gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
参考：0.397175秒，快速：0.011255秒，渐进：0.010018秒

clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref：0.352381 secs，fast：0.011926 secs，渐进式：0.009269 secs（端口7的计数非常低，clang使用存储的索引寻址）

clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref：0.293014 secs，fast：0.011777 secs，渐进式：0.009235 secs

-march = skylake（允许AVX2使用256位整数向量）对两者都有帮助，但是@njuffa最有用，因为它有更多的向量化了（包括最里面的循环）：

gcc8.2 -O3 -march=skylake -fpie -no-pie
参考：0.328725秒，快速：0.007621秒，渐进：0.010054秒（gcc显示“渐进”没有增益，只有“快”）

gcc8.2 -O3 -march=skylake -fno-pie -no-pie
参考：0.333922秒，快速：0.007620秒，渐进：0.009866秒

clang7.0 -O3 -march=skylake -fpie -no-pie
参考：0.260616秒，快速：0.007521秒，渐进：0.008535秒（IDK为什么渐进比-march = sandybridge快；它没有使用BMI1 andn。我想是因为它为k = 0..20外循环使用了256位AVX2。与vpaddq）

clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref：0.259159 secs，fast：0.007496 secs，渐进式：0.008671 secs

如果没有AVX，只是SSE4.2：（），-march=nehalem奇怪的是铛渐进比使用AVX /调=的Sandy Bridge更快。“快速”仅比使用AVX慢。

gcc8.2 -O3 -march=skylake -fno-pie -no-pie
参考：0.337178秒，快速：0.011983秒，渐进：0.010587秒

clang7.0 -O3 -march=skylake -fno-pie -no-pie
参考：0.293555秒，快速：0.012549秒，渐进：0.008697秒

-fprofile-generate/ -fprofile-use为GCC提供一些帮助，尤其是对于“ ref”版本，默认情况下它根本不会展开。

我强调了最好的，但通常它们彼此在测量噪声容限之内。它是不足为奇的-fno-pie -no-pie，有时快：索引静态阵列，[disp32 + reg]是不是变址寻址模式，只是基地+ disp32对比，所以也没有永远unlaminate上的SandyBridge-系列CPU。

但是用gcc有时会-fpie更快。我没有检查，但我认为gcc只是在可能进行32位绝对寻址时以某种方式使自己陷入困境。或者只是代码天真无邪的差异导致对齐或uop-cache问题；我没有详细检查。

对于SIMD，我们可以简单地uint64_t并行执行2或4x ，仅在将字节扩展为32位元素的最后一步中水平累加。 （也许通过改组in-lane，然后pmaddubsw与的乘数_mm256_set1_epi8(1)一起使用将水平字节对添加到16位元素中。）

TODO：手动矢量化__m128i和__m256i（和__m512i）版本。应该比上述“渐进”时间快2倍，4倍甚至8倍。 硬件预取可能仍然可以跟上它，除非AVX512版本的数据来自DRAM，尤其是在其他线程争用的情况下。每个阅读的qword我们需要做大量的工作。

过时的代码：对位循环的改进

您的便携式标量版本也可以得到改进，通过在3.9GHz Skylake上进行适当的随机输入，可以将其从〜1.92秒（整体分支错误预测率达到34％，并注释掉了快速循环！）提高到〜0.35sec（clang7.0 -O3 -march=sandybridge）。。对于分支版本!= 0而不是，则为1.83秒== m，因为编译器无法证明m总是设置了恰好1位和/或进行了相应的优化。

（与@njuffa或我上面的快速版本相比为0.01秒，因此从绝对意义上来说这是毫无用处的，但是值得一提的是何时使用无分支代码的一般优化示例。）

如果您期望零和一的随机混合，那么您需要一种不会分支的错误分支。这样做+= 0对于为零避免了，并且也意味着，在C抽象机绝对触摸该存储器不管数据的元素。

不允许编译器进行写操作，因此，如果他们想自动对您的if() target[i]++版本进行矢量化处理，则必须使用x86之类的带掩码存储，vmaskmovps以避免非原子读取/重写未修改的元素target。因此，某些可以自动对普通标量代码进行矢量化的假想的未来编译器，将会拥有更轻松的时间。

无论如何，一种写方法是target[i] += (pLong[j] & m != 0);使用bool-> int转换来获取0/1整数。

但是，如果我们仅移动数据并使用隔离低位，则对于x86（可能对于大多数其他体系结构），我们可以获得更好的asm&1。编译器有点笨，似乎看不到这种优化。他们这样做很好的优化掉多余的循环计数器，并转m <<= 1成add same,same有效左移，但他们仍然使用XOR零/ test/ setne创建一个0/1的整数。

像这样的内部循环编译效率更高（但仍然比我们使用SSE2或AVX甚至使用@chrqlie的查找表的标量要差得多，当像这样反复使用时，它将在L1d中保持高温，从而允许SWAR输入uint64_t）：

for (int j = 0; j < 10000000; j++) { #if 1 // extract low bit directly unsigned long long tmp = pLong[j]; for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data target[i] += tmp&1; tmp >>= 1; } #else // bool -> int shifting a mask unsigned long m = 1; for (i = 0; i < 64; i++) { target[i]+= (pLong[j] & m) != 0; m = (m << 1); } #endif
Run Code Online (Sandbox Code Playgroud)
请注意，这unsigned long不能保证是64位类型，也不在x86-64 System V x32（64位模式下为ILP32）和Windows x64中。或在i386 System V等32位ABI中。

编译通过GCC，铛，和ICC的Godbolt编译探险家，这是在用gcc环路1个更少的微指令。但是它们全都是纯标量，c和ICC展开2。

# clang7.0 -O3 -march=sandybridge .LBB1_2: # =>This Loop Header: Depth=1 # outer loop loads a uint64 from the src mov rdx, qword ptr [r14 + 8*rbx] mov rsi, -256 .LBB1_3: # Parent Loop BB1_2 Depth=1 # do { mov edi,

Answer 3

nju*_*ffa 9

即使没有AVX，也可以显着提高速度的一种方法是将数据拆分为最多255个元素的块，并在普通uint64_t变量中按字节累加位计数。由于源数据具有64位，因此我们需要一个由8个字节累加器组成的数组。第一累加器对位置0、8、16，...，56的位进行计数，第二累加器对位置1、9、17，...，57的位进行计数；等等。在完成对数据块的处理之后，我们将计数从字节累加器传输到target计数中。target可以根据上面的描述以一种简单的方式对用于更新最多255个数字的块的计数的功能进行编码，其中BITS是源数据中的位数：

/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

下面显示了整个ISO-C99程序，该程序应至少可以在Windows和Linux平台上运行。它使用PRNG初始化源数据，针对asker的参考实现执行正确性检查，并对参考代码和加速版本进行基准测试。在我的计算机上（3.50 GHz下的Intel Xeon E3-1270 v2），以MSVS 2010进行了完全优化（/Ox）编译后，该程序的输出为：

p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs

Run Code Online (Sandbox Code Playgroud)

其中ref指询问者的原始解决方案。这里的速度提高了约74倍。使用其他（尤其是较新的）编译器会观察到不同的提速。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

/*
  From: geo <gmars...@gmail.com>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

#define N          (10000000)
#define BITS       (64)
#define BLOCK_SIZE (255)

/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

int main (void) 
{
    double start_ref, stop_ref, start, stop;
    uint64_t *pLong;
    unsigned int target_ref [BITS] = {0};
    unsigned int target [BITS] = {0};
    int i, j;

    pLong = malloc (sizeof(pLong[0]) * N);
    if (!pLong) {
        printf("failed to allocate\n");
        return EXIT_FAILURE;
    }
    printf("p=%p\n", pLong);

    /* init data */
    for (j = 0; j < N; j++) {
        pLong[j] = KISS64;
    }

    /* count bits slowly */
    start_ref = second();
    for (j = 0; j < N; j++) {
        uint64_t m = 1;
        for (i = 0; i < BITS; i++) {
            if ((pLong[j] & m) == m) {
                target_ref[i]++;
            }
            m = (m << 1);
        }
    }
    stop_ref = second();

    /* count bits fast */
    start = second();
    for (j = 0; j < N / BLOCK_SIZE; j++) {
        sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
    }
    sum_block (pLong, target, j * BLOCK_SIZE, N);
    stop = second();

    /* check whether result is correct */
    for (i = 0; i < BITS; i++) {
        if (target[i] != target_ref[i]) {
            printf ("error @ %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
        }
    }

    /* print benchmark results */
    printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
    return EXIT_SUCCESS;
}

Run Code Online (Sandbox Code Playgroud)

Answer 4

Ext*_*t3h 7

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.

So just follow the following strategy for unpacking the bits into bytes of a vector: /sf/answers/1696988751/

Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.

After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).

At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.

typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;

static inline __m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    const __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);

    const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: byte == 0 ? 0 : -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}

void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
    {
        __m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
        for (size_t inner = 0; inner < block_size; ++inner) {
            for (size_t j = 0; j < bits / 32; j++)
            {
                const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
                temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
            }
        }
        for (size_t j = 0; j < bits; j++)
        {
            accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
        }
    }
    for (size_t outer = count - (count % block_size); outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.

On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.

Surprisingly, this is actually still 50% slower than the solution proposed by @njuffa.

归档时间：	7 年，3 月前
查看次数：	635 次
最近记录：	6 年，7 月前