标签: thrust

CUDA 二阶递归与推力 Include_scan

我试图了解如何并行递归计算。连续地，计算采用以下形式：

for (int i = 2; i<size; i++)
  {
    result[i] = oldArray[i] + k * result[i-2];
  }

Run Code Online (Sandbox Code Playgroud)

对于i-1索引，我之前的问题有一个解决方案：CUDA强制指令执行顺序

我想修改它以使用i-2，但我无法理解如何将相同的过程应用于二阶计算。应该可以使用该thrust::inclusive_scan功能，但我不知道如何使用。有谁知道解决方案吗？

cuda thrust

def*_*use

lucky-day

0
推荐指数

1
解决办法

247
查看次数

最大使用CUDA的绝对差异

我们有以下串行C代码在运行

两个向量a []和b []:

double a[20000],b[20000],r=0.9;

for(int i=1;i<=10000;++i)
{
    a[i]=r*a[i]+(1-r)*b[i]];
    errors=max(errors,fabs(a[i]-b[i]);
    b[i]=a[i];
}

Run Code Online (Sandbox Code Playgroud)

请告诉我们如何将此代码移植到CUDA和Cublas？

c++ cuda thrust

K r*_*esh

2011 11-24

-1
推荐指数

1
解决办法

1107
查看次数

CUDA推力减少是如此之慢？

我正在学习CUDA.今天,我在书中尝试了一些代码:CUDA Application Design And Development这让我感到惊讶.为什么CUDA推力如此之慢？这是代码和输出.

#include <iostream>
using namespace std;

#include<thrust/reduce.h>
#include<thrust/sequence.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include <device_launch_parameters.h>

#include "GpuTimer.h"

__global__ void fillKernel(int *a, int n)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if(tid <n) a[tid] = tid;
}

void fill(int *d_a, int n)
{
    int nThreadsPerBlock = 512;
    int nBlock = n/nThreadsPerBlock + ((n/nThreadsPerBlock)?1:0);
    fillKernel<<<nBlock, nThreadsPerBlock>>>(d_a, n);
}

int main()
{
    const int N = 500000;
    GpuTimer timer1, timer2;

    thrust::device_vector<int> a(N);

    fill(thrust::raw_pointer_cast(&a[0]), N);

    timer1.Start();
    int sumA = …

Run Code Online (Sandbox Code Playgroud)

cuda thrust

hak*_*ami

lucky-day

-1
推荐指数

1
解决办法

1580
查看次数

查找数组的最大值和最小值时,推力是如此之慢？

这是我的内核调用代码

inline void find_min_max(thrust::device_vector<Npp8u> dev_vec, Npp8u *min, Npp8u *max){
    thrust::pair<thrust::device_vector<Npp8u>::iterator,thrust::device_vector<Npp8u>::iterator> tuple;
    tuple = thrust::minmax_element(dev_vec.begin(),dev_vec.end());
    *min = *(tuple.first);
    *max = *tuple.second;
}

Run Code Online (Sandbox Code Playgroud)

我还使用map-reduce范例和简单的CPU代码,用我的原始CUDA内核实现相同的算法.作为测量的结果,我看到推力是最慢的.

为简洁起见,我使用事件来测量原始CUDA和推力代码.如果事件适用于推力基准测试,我很确定我能正确测量执行时间.

这是测量部分;

    ....
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    thrust::device_vector<Npp8u> image_dev(imageHost, imageHost+N);

    // Device vector allocation
    find_min_max(image_dev,&min,&max);

    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime1;
    cudaEventElapsedTime(&elapsedTime1, start, stop);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    totalTime1 = elapsedTime1/1000
....

Run Code Online (Sandbox Code Playgroud)

我真正的问题是,除了推力中的简单minmax_element函数之外,是否还有可能采用更好的方法？

我的机器规格:这是华硕k55v笔记本电脑与GeForce 635M和i7处理器.

以及Thrust 代码和CPU 代码的所有代码

cuda thrust

ero*_*gol

2013 05-26

-3
推荐指数

1
解决办法

2908
查看次数

使用推力进行统计,编译错误

我想用推力计算均值和标准,我发现了这段代码.我试图使用复杂的值,我遇到了一些问题.

这是代码:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
#include <thrust/extrema.h>
#include <cmath>
#include <float.h>

typedef struct
{
    float re,im;
} mycomplex;


// structure used to accumulate the moments and other
// statistical properties encountered so far.
template <typename T>
struct summary_stats_data
{
    T n;
    T min;
    T max;
    T mean;
    T M2;

    // initialize to the identity element
    void initialize()
    {
        n.re = mean.re = M2.re = 0;
        n.im = mean.im = M2.im = 0;
        min …

Run Code Online (Sandbox Code Playgroud)

cuda thrust

Geo*_*rge

2014 12-19

-3
推荐指数

1
解决办法

265
查看次数

使用推力比我自己的内核慢？

电子信息处理技术

按照罗伯特的建议更改代码，但推力仍然慢得多。

我使用的数据基于两个.dat 文件，因此我在代码中省略了它。

原来的问题

我有两个复数向量已放在 GPU Tesla M6 上。我想计算两个向量的逐元素乘积，即 [x1*y1,...,xN*yN]。两个向量的长度均为 N = 720,896。

代码片段（已修改）

我用两种方法解决这个问题。一种是使用带有类型转换和特定结构的推力：

#include <cstdio>
#include <cstdlib>
#include <sys/time.h>

#include "cuda_runtime.h"
#include "cuComplex.h"

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/complex.h>
#include <thrust/transform.h>
#include <thrust/functional.h>


using namespace std;

typedef thrust::complex<float> comThr;

// ---- struct for thrust ----//
struct Complex_Mul_Complex :public thrust::binary_function<comThr, comThr, comThr>
{
    __host__ __device__
    comThr operator() (comThr a, comThr b) const{
        return a*b;
    }
};

// ---- my kernel function ---- //
__global__ void HardamarProductOnDeviceCC(cuComplex …

Run Code Online (Sandbox Code Playgroud)

c++ cuda thrust

Der*_*kLu

2019 05-19

-5
推荐指数

1
解决办法

613
查看次数