CUBLAS内存分配错误

我尝试按如下方式分配17338896个浮点数元素(大约70 mb):

    state = cublasAlloc(theSim->Ndim*theSim->Ndim, 
                       sizeof(*(theSim->K0)), 
                       (void**)&K0cuda);
    if(state != CUBLAS_STATUS_SUCCESS) {
        printf("Error allocation video memory.\n");
        return -1;
    }

Run Code Online (Sandbox Code Playgroud)

但是,我收到CUBLAS_STATUS_ALLOC_FAILED了变量状态的错误消息.这是否与机器上可用的显卡内存量(我的128 mb)有关,或者这是我可以使用cublasAlloc()函数分配的内存量的限制(即与数量无关)机器上可用的内存)？我尝试使用cudaMalloc()函数,我遇到了同样的问题.提前感谢您对此进行调查.

--------------添加错误再现-------------------------------- -----

#include <cuda.h>
#include <stdio.h>
int main (int argc, char *argv[]) {

    // CUDA setup
    cublasStatus state;

    if(cublasInit() == CUBLAS_STATUS_NOT_INITIALIZED) {
        printf("CUBLAS init error.\n");
        return -1;
    }

    // Instantiate video memory pointers
    float *K0cuda;

    // Allocate video memory needed
    state = cublasAlloc(20000000, 
                        sizeof(float), 
                        (void**)&K0cuda);
    if(state != CUBLAS_STATUS_SUCCESS) {
        printf("Error allocation video memory.\n");
        return -1;
    }

    // Copy …

Run Code Online (Sandbox Code Playgroud)

c++ memory-management cuda cublas

sta*_*tor

2016 01-02

1
推荐指数

1
解决办法

1444
查看次数

cublas 未能同步停止事件？

我正在使用matrixMulCUBLAS示例代码，并尝试将默认矩阵大小更改为稍微更有趣的 rows=5k x cols=2.5k ，然后Failed to synchronize on the stop event (error code unknown error)!当所有计算完成时，该示例失败，并在第 #377 行出现错误，它是显然是在清理古巴人。这是什么意思？以及如何修复？

我已经安装了 cuda 5.0，EVGA FTW nVidia GeForce GTX 670内存为 2GB。截至目前，驱动程序版本为最新版本 314.22。

cuda gpu cublas

Sky*_*ker

lucky-day

1
推荐指数

1
解决办法

2300
查看次数

为什么CUBLAS使用const指针作为参数？

例如,

cublasStatus_t cublasSgemm(cublasHandle_t handle,
                       cublasOperation_t transa, cublasOperation_t transb,
                       int m, int n, int k,
                       const float           *alpha,
                       const float           *A, int lda,
                       const float           *B, int ldb,
                       const float           *beta,
                       float           *C, int ldc)

Run Code Online (Sandbox Code Playgroud)

这涉及许多困惑点:

什么是const实现？
为什么我们必须为标量参数提供指针？
这与这有什么关系CUBLAS_POINTER_MODE_HOST？
我们是否需要显式创建临时const变量来传递它们,还是普通指针会这样做？

CUBLAS图书馆

pointers cuda gpgpu const cublas

mch*_*hen

2013 05-04

1
推荐指数

1
解决办法

731
查看次数

在CUDA中查找最大/最小值而不将其传递给CPU

我需要找到浮点数组中最大元素的索引.我正在使用函数"cublasIsamax",但这会将索引返回给CPU,这会减慢应用程序的运行时间.

有没有办法有效地计算这个索引并将其存储在GPU中？

谢谢!

parallel-processing cuda nvidia cublas

rod*_*dms

lucky-day

1
推荐指数

1
解决办法

717
查看次数

混合 Thrust 和 cuBLAS 会产生意外的输出结果

我喜欢推力库，尤其是它如何很好地隐藏了 cudaMalloc、cudaFree 等的复杂性。

我想对矩阵的所有列求和。所以我使用了 cuBlas 的“cublasSgemv”并将我的矩阵乘以一个向量。这是我的代码：

void sEarColSum(std::vector<float>& inMatrix, int colSize)
{
    cublasHandle_t handle; // CUBLAS context
    float al = 1.0f; // al =1
    float bet = 1.0f; // bet =1
    int rowSize = inMatrix.size() / colSize;

    float *devOutputPtr = thrust::raw_pointer_cast(thrust::device_malloc<float>(colSize));

    thrust::device_vector<float> deviceT2DMatrix(inMatrix.begin(), inMatrix.end());
    float* device2DMatrixPtr = thrust::raw_pointer_cast(deviceT2DMatrix.data());

    thrust::device_vector<float> deviceVector(rowSize, 1.0f);
    float* deviceVecPtr = thrust::raw_pointer_cast(deviceVector.data());

    cublasCreate(&handle);
    cublasSgemv(handle, CUBLAS_OP_N, colSize, rowSize, &al, device2DMatrixPtr, colSize, deviceVecPtr, 1, &bet, devOutputPtr, 1);

    std::vector<float> outputVec(colSize);
    cudaMemcpy(outputVec.data(), devOutputPtr, outputVec.size() * sizeof(float), cudaMemcpyDeviceToHost);

    for (auto elem : …

Run Code Online (Sandbox Code Playgroud)

c++ cuda thrust cublas

Kad*_*mir

lucky-day

1
推荐指数

1
解决办法

667
查看次数

CUBLAS 矩阵乘法与行主数据无转置

我目前正在尝试在我的 GPU 上使用 CUBLAS 实现矩阵乘法。

它适用于方阵和某些大小的输入，但对于其他人，最后一行不会返回（并且包含 0，因为这是我实现它的方式）。

我认为这是的分配或语法问题cublasSgemm，但我找不到它的位置。

注意：如果您不熟悉 CUBLAS：它是column-majored，这就是为什么看起来操作以另一种方式执行的原因。

任何帮助，将不胜感激。

编码：

请注意，gpuErrchk和cublasErrchk在这里当然无关紧要。

#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#include <vector>

std::vector<float> CUDA_mult_MAT(const std::vector<float> &data_1 , const uint64_t data_1_rows, const uint64_t data_1_columns,
                                 const std::vector<float> &data_2 , const uint64_t data_2_rows, const uint64_t data_2_columns){

    cublasHandle_t handle;

    cublasErrchk(cublasCreate(&handle));

    std::vector<float> result(data_1_rows * data_2_columns); //Vector holding the result of the multiplication

    /*----------------------------------------------------------------------------------------------*/

    float* GPU_data_1 = NULL;
    gpuErrchk(cudaMalloc((void**)&GPU_data_1 , data_1.size()*sizeof(float))); //Allocate memory on the GPU
    gpuErrchk(cudaMemcpy(GPU_data_1, …

Run Code Online (Sandbox Code Playgroud)

c++ cuda cublas

Ere*_*rel

2020 11-16

1
推荐指数

1
解决办法

133
查看次数

thrust :: max_element比较cublasIsamax慢 - 更有效的实现？

我需要一个快速有效的实现来查找CUDA中数组中最大值的索引.此操作需要执行多次.我最初使用cublasIsamax,然而,它遗憾地返回最大绝对值的索引,这不是我想要的.相反,我使用的是thrust :: max_element,但与cublasIsamax相比速度相当慢.我以下列方式使用它:

//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;

Run Code Online (Sandbox Code Playgroud)

载体中元素的数量范围在10'000到20,000之间.thrust :: max_element和cublasIsamax之间的速度差异相当大.也许我在不知情的情况下执行几次内存交易？

c++ performance cuda thrust cublas

spu*_*rra

lucky-day

0
推荐指数

1
解决办法

1621
查看次数

如何在复杂数字上使用CUBLAS执行Hadamard产品？

我需要计算复数的两个向量(Hadamard乘积)的元素明智乘法与NVidia CUBLAS.不幸的是,CUBLAS中没有HAD操作.显然,您可以使用SBMV操作执行此操作,但它不适用于CUBLAS中的复数.我无法相信没有办法用CUBLAS实现这一目标.对于复杂的数字,有没有其他方法可以实现CUBLAS？

我不能编写自己的内核,我必须使用CUBLAS(或其他标准的NVIDIA库,如果CUBLAS真的不可能).

cuda gpu nvidia cublas

Bap*_*cht

lucky-day

0
推荐指数

1
解决办法

1129
查看次数

将 cuBLAS 与 Thrust 中的复数结合使用

在我的代码中，我使用推力库中的复数数组，我想使用 cublasZgeam() 来转置数组。

使用 cuComplex.h 中的复数并不是一个更好的选择，因为我对数组进行了大量算术运算，并且 cuComplex 没有定义的运算符，例如 * +=。

这就是我定义要转置的数组的方式

thrust::complex<float> u[xmax][xmax];

Run Code Online (Sandbox Code Playgroud)

我找到了这个https://github.com/jtravs/cuda_complex，但这样使用它：

#include "cuComplex.hpp"

Run Code Online (Sandbox Code Playgroud)

使用 nvcc 编译时不允许我使用提到的运算符

error: no operator "+=" matches these operands
        operand types are: cuComplex += cuComplex

Run Code Online (Sandbox Code Playgroud)

有什么解决办法吗？github 上的代码很旧，可能存在问题，或者可能是我使用错误

编辑：这是有效的代码，与talonmies代码的唯一区别是添加简单的内核和指向相同数据的指针，但推力::复杂

#include <iostream>
#include <thrust/fill.h>
#include <thrust/complex.h>
#include <cublas_v2.h>

using namespace std;

__global__ void test(thrust::complex<double>* u) {

  u[0] += thrust::complex<double>(3.3,3.3);
}

int main()
{
  int xmax = 100;
  thrust::complex<double>  u[xmax][xmax];
  double arrSize = sizeof(thrust::complex<double>) * xmax * xmax;

  thrust::fill(&u[0][0], &u[0][0] + (xmax * xmax), …

Run Code Online (Sandbox Code Playgroud)

c++ cuda thrust cublas

Max*_*x K

2017 04-18

0
推荐指数

1
解决办法

2887
查看次数

cublasGemmEx 结果始终为零

我尝试使用 cublasGemmEx 进行矩阵乘法。A和b是1X1半矩阵。如果我将计算类型和输出日期类型设置为 CUDA_R_16F，结果始终为零。如果我将计算类型和输出日期类型设置为 CUDA_R_32F，结果是正确的。

有谁知道为什么如果我将类型设置为 CUDA_R_16F 结果为零？感谢您提前的答复。

我的cuda版本是10.2，gpu是T4。我使用命令 'nvcc -arch=sm_75 test_cublas.cu -o test_cublas -lcublas' 构建以下代码

#include "cublas_v2.h"
#include "library_types.h"
#include <stdio.h>

__global__ void init_kernel(half *a, half *b, half *c_half, float *c_float)
{
    *a = __float2half_rn(1.0);
    *b = __float2half_rn(1.5);
    *c_half = __float2half_rn(0.0);
    *c_float = 0.0;
}

__global__ void print_gpu_values(half *a, half *b, half *c_half, float *c_float)
{
    printf("a %f, b %f, c_half %f, c_float %f\n", __half2float(*a), __half2float(*b), __half2float(*c_half), *c_float);
}

int main(int argc, char **argv)
{
    cudaStream_t cudaStream;
    if (cudaSuccess …

Run Code Online (Sandbox Code Playgroud)

c++ cuda cublas

xkc*_*kcd

lucky-day

0
推荐指数

1
解决办法

840
查看次数