相关疑难解决方法(0)

使用CUDA减少矩阵行

Windows 7, NVidia GeForce 425M.

Run Code Online (Sandbox Code Playgroud)

我写了一个简单的CUDA代码来计算矩阵的行和.矩阵具有单维表示(指向浮点的指针).

代码的串行版本如下(它有2循环,如预期的那样):

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

Run Code Online (Sandbox Code Playgroud)

在CUDA代码中,我调用内核函数按行扫描矩阵.下面是内核调用片段:

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

Run Code Online (Sandbox Code Playgroud)

以及执行行的并行求和的内核函数(仍有1 …

c cuda matrix

MSa*_*ich

2015 04-09

13
推荐指数

2
解决办法

1万
查看次数

如何使CUDA中的矩阵列标准化并获得最大性能？

如何有效地规范化CUDA中的矩阵列？

我的矩阵存储在column-major中,典型大小为2000x200.

该操作可以用以下matlab代码表示.

A = rand(2000,200);

A = exp(A);
A = A./repmat(sum(A,1), [size(A,1) 1]);

Run Code Online (Sandbox Code Playgroud)

这可以通过Thrust,cuBLAS和/或cuNPP有效地完成吗？

包括4个内核的快速实现如下所示.

想知道这些是否可以在1或2个内核中完成以提高性能,尤其是对于由cublasDgemv()实现的列求和步骤.

#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/transform.h>
#include <thrust/iterator/constant_iterator.h>
#include <math.h>

struct Exp
{
    __host__ __device__ void operator()(double& x)
    {
        x = exp(x);
    }
};

struct Inv
{
    __host__ __device__ void operator()(double& x)
    {
        x = (double) 1.0 / x;
    }
};

int main()
{
    cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);
    cublasHandle_t hd;
    curandGenerator_t rng;
    cublasCreate(&hd);
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_DEFAULT);

    const size_t m = 2000, …

Run Code Online (Sandbox Code Playgroud)

performance cuda matrix thrust cublas

kan*_*yin

2013 01-09

7
推荐指数

1
解决办法

4007
查看次数