如何加速iOS/Mac OS的金属代码

Eps*_*lon 7 macos ios metal

我正在尝试在Metal中实现代码,该代码在两个具有长度的向量之间执行一维卷积.我已经实现了以下正常工作

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}
Run Code Online (Sandbox Code Playgroud)

我的问题是使用Metal处理(计算+与GPU之间的数据传输)相同的数据需要大约10倍的时间,而不是CPU上的Swift.我的问题是如何用单个向量操作替换内部循环还是有另一种方法来加速上面的代码?

war*_*enm 12

在这种情况下利用GPU并行性的关键是让它为您管理外部循环.我们不会为整个数据向量调用一次内核,而是为数据向量中的每个元素调用它.内核函数简化了这个:

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}
Run Code Online (Sandbox Code Playgroud)

为了分派这项工作,我们根据计算管道状态建议的线程执行宽度选择一个线程组大小.这里一个棘手的问题是确保输入和输出缓冲区中有足够的填充,这样我们就可以稍微超出数据的实际大小.这确实会导致我们浪费少量的内存和计算,但是为了节省我们执行单独调度的复杂性,只是为了计算缓冲区末尾元素的卷积.

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
Run Code Online (Sandbox Code Playgroud)

在我的实验中,这种并行化方法比问题中的串行版本快400-1000倍.我很想知道它与你的CPU实现相比如何.