使用按位AND和popcount而不是实际int或float的大(0,1)矩阵乘法？

Question

使用按位AND和popcount而不是实际int或float的大(0,1)矩阵乘法？

NUL*_*ULL 8 c++ sse avx bitset matrix-multiplication

对于乘法大二进制矩阵(10Kx20K),我通常要做的是将矩阵转换为浮点数并执行浮点矩阵乘法,因为整数矩阵乘法非常慢(请看这里).

但这一次,我需要执行超过数十万次这样的乘法运算,甚至平均事情上的毫秒级性能提升.

我想要一个int或float矩阵作为结果,因为产品可能有非0或1的元素.输入矩阵元素都是0或1,因此它们可以存储为单个位.

在行向量和列向量之间的内积中(为了产生输出矩阵的一个元素),乘法简化为按位AND.添加仍然是添加,但我们可以添加具有填充计数功能的位,而不是单独循环它们.

一些其他布尔/二进制矩阵函数或位而不是计数它们,产生位矩阵结果,但这不是我需要的.

下面是一个示例代码,显示将问题形成为std::bitset, AND并且count操作比矩阵乘法更快.

#include <iostream>
using std::cout; using std::endl;
#include <vector>
    using std::vector;
#include <chrono>
#include <Eigen/Dense>
    using Eigen::Map; using Eigen::Matrix; using Eigen::MatrixXf;
#include <random>
    using std::random_device; using std::mt19937; using std::uniform_int_distribution;
#include <bitset>
    using std::bitset;

using std::floor;

const int NROW = 1000;
const int NCOL = 20000;

const float DENSITY = 0.4;
const float DENOMINATOR = 10.0 - (10*DENSITY);

void fill_random(vector<float>& vec) {
    random_device rd;
    mt19937 eng(rd());
    uniform_int_distribution<> distr(0, 10);
    int nnz = 0;
    for (int i = 0; i < NROW*NCOL; ++i)
        vec.push_back(floor(distr(eng)/DENOMINATOR));
}

void matmul(vector<float>& vec){
    float *p = vec.data();
    MatrixXf A = Eigen::Map<Eigen::Matrix<float, NROW, NCOL, Eigen::RowMajor>>(p);
    cout << "Eigen matrix has " << A.rows() << " rows and " << A.cols() << " columns." << endl;
    cout << "Total non-zero values : " << A.sum() << endl;
    cout << "The density of non-zero values is " <<  A.sum() * 1.0 / (A.cols()*A.rows()) << endl;

    auto start = std::chrono::steady_clock::now();
    MatrixXf B = A.transpose() * A;
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "Mat mul took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "Eigen coo ";
    for (int i=0; i<10; ++i)
        cout << B(0,i) << " ";
    cout << endl;
}


void bitset_op(vector<float>& vec) {
    // yeah it's not a great idea to set size at compile time but have to
    vector<bitset<NROW>> col_major(NCOL);

    // right, multiple par for isn't a good idea, maybe in a parallel block
    // Doing this for simplicity to profile second loop timing 
    // converting row major float vec to col major bool vec
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int i=0; i < NROW; ++i) {
            col_major[j].set(i, vec[i*NCOL + j] && 1);
        }
    }

    auto start = std::chrono::steady_clock::now();
    vector<int> coo;
    coo.assign(NCOL*NCOL, 0);
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int k=0; k<NCOL; ++k) {
            coo[j*NCOL + k] = (col_major[j]&col_major[k]).count();
        }
    }
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "bitset intersection took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "biset coo ";
    for (int i=0; i<10; ++i)
        cout << coo[i] << " ";
    cout << endl;
}


int main() {
    // Saving to float instead of int to speed up matmul
    vector<float> vec;
    fill_random(vec);
    matmul(vec);
    bitset_op(vec);
}

Run Code Online (Sandbox Code Playgroud)

运行此:

g++ -O3 -fopenmp -march=native -I. -std=c++11 code.cpp -o code

Run Code Online (Sandbox Code Playgroud)

我明白了:

Eigen matrix has 1000 rows and 20000 columns.
Total non-zero values : 9.08978e+06
The density of non-zero values is 0.454489
Mat mul took 1849 ms
Eigen coo 458 206 208 201 224 205 204 199 217 210
bitset intersection took 602 ms
biset coo 458 206 208 201 224 205 204 199 217 210

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,matmul作为一组bitset操作比Eigen的float matmul快3倍,这是有道理的.

我想强调一下,我需要在100K以上(在HPC或云中)执行此操作,平均而言,毫秒级的性能提升会产生影响.

我不受任何特定库,C++标准等的约束.所以请随意回答任何你认为比使用GPU更快的解决方案,因为我出于多种原因无法使用它.

Answer 1

Pet*_*des 12

我不知道你有多少,如果有的话,你可以让编译器为你做,而不用手动矢量化内在函数或C++矢量级包装器(如Agner Fog的VCL,如果你的项目的许可证与GPL兼容).还有一些非GPLed包装器.

缓存阻塞矩阵乘法是一门艺术(在这里很重要),如果你可以使用Eigen的现有模板但是使用不同的类and在整数上使用按位而不是乘法浮点数,那将是非常好的.我不确定这是否可行.

我做了一些搜索,大多数关于二进制矩阵的文献都是关于产生一个布尔结果(包括像这样的 SO问题).使用AND作为乘法来完成向量内积,但是使用XOR或OR作为加法,而不是popcount.也许有一个我缺少的搜索术语描述了恰好是(0,1)矩阵的"正常"矩阵,但产品不会.

由于每毫秒都很重要,您可能需要手动对其进行矢量化.

这不是矢量整数的东西通常很慢,它只是矢量整数乘法慢,特别是与float最近的x86硬件上的矢量FMA 相比(特别是英特尔,它在Haswell及其后的每个时钟的FP FMA吞吐量为2x 256b矢量) ).

由于您不需要与布尔元素实际乘法,只需一个AND(每个时钟吞吐量3个向量),这对您来说不是问题.通过为每个矢量执行更多元素而获得的效率增益应该足以弥补每个矢量的任何额外成本.

当然,这假设一个整数matmul实现使用所有相同的缓存阻塞和其他优化作为等效的FP matmul,如果你不想(或不知道如何)自己编写它,那就是问题所在,找不到能为你做这件事的图书馆.

我只是回答它是如何有效的问题可能是,以最佳的实现. 答案标题问题是一个非常明确的是,这时候要用实际乘法一个巨大的浪费,特别是与32位元素.

存储格式选项:

每个字节一个元素(0/1):

4倍密度float(缓存占用/内存带宽/每个向量的元素)
易于使用字节混洗进行转置
垂直ADD很容易,如果重要的话(例如,用于外部循环矢量化,一次处理多行或多列.如果你的数据以一种方式交错,那么可以很好(最后避免水平和)使这项工作没有额外的改组.)

每字节4个元素,打包到低半字节:

4x单独字节的密度
使用AVX2非常高效地弹出帐号vpshufb.对于L1D高速缓存中的输入热,您可以在理论上加载/ AND/accumulate-a-popcount,每个时钟周期(每个内核)的吞吐量为128个AND结果元素.每个时钟4个融合域uop使每个时钟4的SKL/HSW前端发布带宽饱和,并且不会在3个向量ALU端口上产生瓶颈,因为其中一个uop是纯负载.(其他负载微型保险丝vpand).当L2负载瓶颈(每个周期约为32B负载)时,每个时钟运行64个元素.见下文.
从整数或压缩位图创建的速度较慢(但如果以交错顺序将位放入向量中以便有效地打包/解包到有序字节,而不是强制位按顺序排列,则可以更慢).
难以转置(可能比完全包装更糟)

打包位:

8x单独字节的密度,每个AVX2矢量256个元素.
可以从具有pmovmskb非交错存储顺序的向量创建.(但是对于动态创建来说不是很有用,因为它将结果放在整数寄存器中,而不是向量.交错的位顺序可能是最好的,尤其是在转置期间解包).
用AVX2 popcount相当高效:mask/shift + mask/2x vpshufb.(9个融合域uop(8个向量-ALU uop)到AND +累加popcount用于256个元素(来自2个行/列向量),而8个uop(6个向量-ALU uop)用于4个每字节策略(来自4个行/列向量).)ALU端口瓶颈限制为每个时钟从L1D或L2到96个元素.因此,当它在L2带宽上遇到瓶颈时,它的内部产品吞吐量大约是pack4策略的1.5倍,或者理论上仅仅计算内部循环时,L1D中数据吞吐量的3/4.这只是内部产品的一部分,不考虑不同的包装/拆包成本.
难以转置(但从pmovmskb每个字节中提取1位并使它们连续可能并不可怕).

每字节6个元素,0xxx0xxx(对于HSW/SKL上的这个问题可能没什么优势,但有趣的是要考虑):

6x单独字节的密度
通过移位/ OR运算,以交错方式从0/1字节创建相当容易,与每字节格式的4位相同.
使用AVX2针对高效popcount进行了优化vpshufb.无需在2x之前屏蔽vpshufb,只需1次右移.(vpshufb如果设置了高位,则将字节置零;否则,它使用低半字节作为索引.这就是它需要屏蔽的原因.)将此格式右移4(vpsrld ymm0,4)仍将在每个字节的高位中保留零.加载+ AND - >累积popcount是每个向量7个融合域uops(vmovdqa/ vpand ymm,[mem]/ vpsrld ymm,4/ 2x vpshufb/ 2x vpaddb),其中只有6个需要ALU端口.因此,HSW/SKL吞吐量理论上是每2个时钟1个向量(192个元素),或每个时钟96个元素.这需要每个时钟平均负载吞吐量为一个256b矢量,因此它正好迎接L2带宽瓶颈.

理论上它与完全打包相同,但实际上它可能稍微更快或更慢,具体取决于哪一个更好地调度(例如,更少的AND/ADD uops从shuffle窃取端口5).完全打包可能更接近理论速度,因为它的更多uop可以在多个端口上运行.无序调度缺陷的可能性较小.
该pmovmskb转招不干净的工作.
如果我们只是需要popcount(A[])而不是,可能会有用popcount(A[] & B[]).或者针对不同的微体系结构,其中ALU与负载吞吐量不同.

对此的另一个变化是,每个字节有7个元素可以用单个AVX512VBMI(Cannonlake？)vpermi2b(_mm512_permutex2var_epi8)弹出,其中每个索引字节从另外两个寄存器的串联中选择128个字节中的一个.一个广泛的洗牌可能会很慢,但它有望比AVX512 vpshufb单独啃食更好的吞吐量.

要使用AVX512VBMI计数打包8(但没有AVX512VPOPCNTDQ),您可以使用vpermi2b计数低7,然后移位+屏蔽最高位并添加它.(一位的popcount =该位).

uint8_t元素更容易有效地移动(因为有字节混洗vpshufb),所以如果你必须动态转置,可能值得考虑.或者只是在移调时动态地打包到位？

32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).

This is also a big deal for cache footprint/memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier/less important.

10k*20k/8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC/cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)

Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.

With 8-bit integer 0/1 elements, you can do up to 255 vpaddb before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq, then horizontally sum it.

With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use a VPSHUFB-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb could count the bits in each byte of an AND result, without needing any shifting/masking. Inside the inner loop, you'd have 2 loads, vpand, vpshufb, vpaddb. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw/vpaddq. (But with cache-blocking, you probably want to break out that often anyway and do different work). So the inner-most loop should run at 4 elements per byte*32B per vector = 128 elements per clock cycle, if you can arrange for it to read data that's hot in L1D cache. Expect about half that bandwidth from L2 cache on Haswell/Skylake, or much worse from L3 cache.

With uint8_t elements that are 0 or 1, you can maybe use some integer multiply-add instructions. They're a bit weirdly designed, intended for different use-cases than FP FMA. They add horizontal pairs of multiply results, producing wider elements. VPMADDUBSW widens from 8 to 16 bit elements, and would work well on 0s and 1s. Since each element can only be in the range 0..2, you can still horizontal-sum with vpsadbw. But if you're going to vpsadbw, this gains you nothing over vpand. It would only be useful if you wanted to use vpaddw to use 16-bit elements in your vector accumulator, instead of breaking out of a loop to avoid byte overflow. vpmaddubsw doesn't seem useful here, becausevpsadbw` is a better way to horizontally add bytes.

Converting 0/1 integers to bitmaps can be done efficiently with SSE/AVX: For 32-bit integer elements, vpslld ymm0, 31 to left-shift the relevant bit to the top of each element, then vmovmskps eax, ymm0 to get an 8-bit mask of the high byte of each 32-bit element. For 8-bit integer elements, vpslld ymm0, 7/vpmovmskb eax, ymm0 to do the same thing but for each byte, producing a 32-bit integer bitmap result. (Only the sign bit of each byte matters, so it's fine that there are no shift instructions with only 8 bit granularity. You don't need to do anything about the bits that carry into the next element.)

This isn't a very good method for using right away with vectors, because you end up with the results in integer registers. This isn't a great format to generate and use on the fly, but it is the most compact so it can make sense if you can keep matrices in this format long-term. (And if you'll be limited by memory bandwidth when loading them.)

Converting 32-bit integers to 8-bit: One ways is with 2x vpackssdw + vpacksswb. Because those operate within the 128b lanes, your elements will end up reordered. But that's ok as long as it's the same ordering for every row/column. It's only a problem if you want to take a chunk of a row/column that doesn't start at a multiple of 32 elements. Another option here is to left-shift (by 8, by 16, and by 24), and OR vectors together. Actually, you can do the shifting for free by using an unaligned load offset by 1, 2, or 3 bytes.

static inline
__m256i load_interleave4x32(const int32_t *input) {
  const char *p = (const char*)input;
  __m256i t0 = _mm256_load_si256((const __m256i*)(p));
  __m256i t1 = _mm256_load_si256((const __m256i*)(p+32*1-1));  // the 1/0 bits will be in the 2nd byte of each 32-bit element
  __m256i t2 = _mm256_load_si256((const __m256i*)(p+32*2-2));
  __m256i t3 = _mm256_load_si256((const __m256i*)(p+32*3-3));
  return t0 | t1 | t2 | t3;
  // or write this out with _mm256_or_si256, if you don't have overloaded operators like GNU C does.
  // this should compile to 1 load and 3 vpor ymm0, [rdi+31] ... instructions.
}

Run Code Online (Sandbox Code Playgroud)

Converting to half-packed 4 bits per byte: we can use the same idea as above. Get 4 vectors from load_interleave4x32 (or from an array of uint8_t if you started with 8-bit elements). Left-shift them by 0, 1, 2, and 3 bits, and OR those all together. This interleaved bit-order is fine when we just need to AND a row/column and popcount the whole result, because order doesn't matter. This bit-order is fairly efficient to unpack back to in-order bytes, e.g. AND with set1_epi8(1) will get you a vector of bytes.

You might use this as part of a transpose if you store your whole matrices in this format, or you could use this format to store temporary copies for a cache-blocked transpose. A matmul touches each row/column multiple times, so it may be worth doing extra work to make a compact format the first time when that lets you do 4x as much work per vector on subsequent passes.

With AVX512BW (Skylake-AVX512)

We really want to be doing the AND and popcnt with vectors, not with scalar integer, because the vectors are twice as wide as AVX2, so they pull further ahead of scalar popcnt. (Even though Skylake-AVX512 shuts down the vector ALUs (but not scalar) on port 1 while running 512b instructions).

@Harold points out an interesting identity that lets us do 2/3rds as many vector popcounts, at the cost of extra integer operations.

   popcnt(a) + popcnt(b) + popcnt(c)
 = popcnt(a ^ b ^ c) + 2 * popcnt((a ^ b) & c | (a & b))

Run Code Online (Sandbox Code Playgroud)

a ^ b ^ c and (a ^ b) & c | (a & b) can be done with one vpternlogd each (since each have 3 boolean inputs). The 2* is free if we use a separate pre-shifted vpshufb LUT vector. See also this implementation that uses 30x vpternlogd + 1 vector popcnt to handle 16 vectors of 512b, with some cleanup at the end (only doing the 16*popcnt counts inside the loop; everything else is chained).

This is very likely worth it for counting fully-packed 8 bits per byte elements, and makes that format a lot more attractive for AVX512, compared to less-dense formats optimized for popcounting without as much shifting/masking.

vpternlogd can also be useful as a bit-blend instruction for transposes, if byte-granularity VPBLENDMB zmm{k1}, zmm, zmm isn't fine-enough grained.

This might be worth it for AVX2 on some CPUs, maybe avoiding 1 out of every 4 or 5 vector popcounts rather than 1 of 3? Or it might not help at all if it just increases the total execution port pressure, and there wasn't a bottleneck on any specific one. It would be useful with scalar popcnt instructions (maybe on CPUs without AVX2), because those do bottleneck on a single port on Intel CPUs.

We can turn uint8_t boolean elements into non-interleaved bitmaps slightly more efficiently than AVX2 (without even needing a shift), and do the reverse much more efficiently. Test-into-mask or compare-into-mask against a vector of set1_epi8(1) would both do the job, producing 64 bits of mask from 64 bytes of input. Or with 32-bit integers to start with, producing 16 bits of mask at a time. You can efficiently concatenate those bits with kunpck instructions.

_mm512_test_epi8_mask (vptestmb) is interesting: AND two vectors together, and produce a mask-register result of byte-elements that were true/false. But this isn't really what we want: if we're going to pack our bits, we want to do it as a pre-processing step on the input matrices, not on the fly while doing inner products.

bitmap -> vector of 0/-1 is fast: __m512i _mm512_movm_epi8 (__mmask64 k) (vpmovm2b) does that in one instruction. You can subtract -1 instead of adding 1, but you'd have to mask it before you can OR together multiple bits within a byte.

Without AVX512BW or AVX512DQ (Knight's Landing Xeon Phi), you don't have 512b vpshufb so you can't vector popcnt as efficiently. There's an AVX512 popcnt extension for vector popcnt directly, but no hardware with it has even been announced yet. (AVX2 vpshufb ymm is very slow on KNL, though: one per 12 cycles, and psadbw ymm is 1 per 9 cycles, so even using 256b vectors is unattractive). You might use a bithack popcnt based on 32-bit integer elements, since that's just AND/shift/ADD. 32-bit elements will take fewer steps to popcnt than 64-bit, but are still big enough not to overflow for reasonable problem sizes (so you can defer a horizontal sum of the vector until outside a loop)

Given the choice of storage format, packing multiple bits per byte might not be a good idea for KNL, but single-byte integer elements are good. vpandd zmm and vpaddd zmm are both fast and part of AVX512F, and we can use them because we don't want to let our single-bytes overflow anyway. (Using a packed 32-bit add when we actually have 8-bit elements that won't carry into each other is a SWAR technique.) KNL has good memory bandwidth and poor instruction throughput relative to Skylake-AVX512, I think.

Transposing bits:

BMI2 _pdep_u64 might be useful here. It's a scalar instruction/intrinsic. If it makes bit-transpose a lot more efficient than unpacking to bytes, you'd probably want to store a block of transpose results before reloading it with vector loads for AND + count. (Reloading a vector right away after scalar stores will cause a store-forwarding stall.)

Another useful option is that vpmovmskb can slice 32 bits out of a 32-byte vector, one per byte. This gives you a building block for a transpose, maybe combined with byte shuffles to get the bytes in the right order for it. For more, see this blog post, and also How would you transpose a binary matrix?.

Using this in a matmul

Some of your choices depend on what format your input data is in, and how often you will reuse the same matrices. If a matrix will be used multiple times, packing it down to 4 or 8 bits per byte ahead of time makes sense. (Or on the fly the first time it's used). Keeping a transposed copy of it may also make sense, especially if it will always be the side of the multiply that needs transposing. (If you sometimes need one way and sometimes the other, redoing on the fly might be better for L3 cache footprint. But these are big enough that you probably won't get a lot of L3 hits, so just keeping a transposed copy could be good.)

Or maybe even write out a transposed and non-transposed version while converting from your input format.

You will definitely want to cache-block the multiplies, so the same data is reused multiple times while hot in L1. I don't have anything useful to say about this off the top of my head. The same principles apply as when cache-blocking a normal FP matmul, so go read about that.

Comments on your C++ implementation:

Using a bitset & for a whole column will put the values back in memory, and then you'll loop over them again in .count() on the result. I doubt that the compiler will optimize this into a one-pass loop that uses a VPSHUFB-based bit-slicing popcnt on each vector of VPAND results, but that would be much better. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

With your matrix sizes, at least that inner loop probably hits in L1D cache, but the extra load/store instructions from looping twice are more overhead and it also interferes with prefetch of the valuable data.

Getting compilers to efficiently popcnt a dynamically-sized bitmap (without manually vect

归档时间：	8 年，3 月前
查看次数：	1145 次
最近记录：	8 年，3 月前