小编Der*_*kLu的帖子

为什么不同流中的内核执行不是并行的？

我刚刚在CUDA中学习了流技术，并尝试了它。然而，不希望的结果返回，即，流不是并行的。（在GPU Tesla M6，OS Red Hat Enterprise Linux 8上）

我有一个大小为（5,2048）的数据矩阵，还有一个处理矩阵的内核。

我的计划是分解“ nStreams = 4”扇区中的数据，并使用4个流来并行执行内核。

我的部分代码如下所示：

int rows = 5;
int cols = 2048;

int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);

int nStreams = 4;    // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
    checkCudaErrors(cudaStreamCreate(&streams[ii]));
}

int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;

for(int jj=0;jj<nStreams;jj++){
    int offset = jj * streamSize;
    Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
}    // d_Data is …

Run Code Online (Sandbox Code Playgroud)

c++ cuda gpu-programming

Der*_*kLu

2019 04-29

0
推荐指数

1
解决办法

114
查看次数

使用推力比我自己的内核慢？

电子信息处理技术

按照罗伯特的建议更改代码，但推力仍然慢得多。

我使用的数据基于两个.dat 文件，因此我在代码中省略了它。

原来的问题

我有两个复数向量已放在 GPU Tesla M6 上。我想计算两个向量的逐元素乘积，即 [x1*y1,...,xN*yN]。两个向量的长度均为 N = 720,896。

代码片段（已修改）

我用两种方法解决这个问题。一种是使用带有类型转换和特定结构的推力：

#include <cstdio>
#include <cstdlib>
#include <sys/time.h>

#include "cuda_runtime.h"
#include "cuComplex.h"

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/complex.h>
#include <thrust/transform.h>
#include <thrust/functional.h>


using namespace std;

typedef thrust::complex<float> comThr;

// ---- struct for thrust ----//
struct Complex_Mul_Complex :public thrust::binary_function<comThr, comThr, comThr>
{
    __host__ __device__
    comThr operator() (comThr a, comThr b) const{
        return a*b;
    }
};

// ---- my kernel function ---- //
__global__ void HardamarProductOnDeviceCC(cuComplex …

Run Code Online (Sandbox Code Playgroud)

c++ cuda thrust

Der*_*kLu

2019 05-19

-5
推荐指数

1
解决办法

613
查看次数