I'm implementing an ultra fast popcount on Intel Xeon® Phi®, as it's a performance hotspot of various bioinformatics software.
I've implemented five pieces of code,
#if defined(__MIC__)
#include <zmmintrin.h>
__attribute__((align(64))) static const uint32_t POPCOUNT_4bit[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
__attribute__((align(64))) static const uint32_t MASK_4bit[16] = {0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF, 0xF};
inline uint64_t vpu_popcount1(uint64_t* buf, size_t n) { …Run Code Online (Sandbox Code Playgroud)