标签: matrix-multiplication

优化 for 循环 RcppArmadillo 中的矩阵乘法

目的是在 R 中实现正交投影非负矩阵分解 (opnmf) 的快速版本。我正在翻译此处提供的 matlab 代码。

我实现了一个普通的 R 版本，但对于 20 因子解决方案，它比我的数据 (~ 225000 x 150) 上的 matlab 实现慢得多（大约慢 5.5 倍）。

所以我认为使用 c++ 可能会加快速度，但它的速度与 R 类似。我认为这可以优化，但不确定如何优化，因为我是 c++ 的新手。这是讨论类似问题的线程。

这是我的 RcppArmadillo 实现。

// [[Rcpp::export]]
Rcpp::List arma_opnmf(const arma::mat & X, const arma::mat & W0, double tol=0.00001, int maxiter=10000, double eps=1e-16) {
  arma::mat W = W0;
  arma::mat Wold = W;
  arma::mat XXW = X * (X.t()*W);
  double diffW = 9999999999.9;
  
  Rcout << "The value of maxiter : " << …

Run Code Online (Sandbox Code Playgroud)

r matrix-multiplication nmf rcpparmadillo

Dat*_*'oh

2020 06-20

0
推荐指数

1
解决办法

415
查看次数

torch.einsum 如何执行这个 4D 张量乘法？

我遇到了一个用于torch.einsum计算张量乘法的代码。我能够理解低阶张量的工作原理，但是不能理解 4D 张量的工作原理，如下所示：

import torch

a = torch.rand((3, 5, 2, 10))
b = torch.rand((3, 4, 2, 10))

c = torch.einsum('nxhd,nyhd->nhxy', [a,b])

print(c.size())

# output: torch.Size([3, 2, 5, 4])

Run Code Online (Sandbox Code Playgroud)

我需要以下方面的帮助：

这里执行的操作是什么（解释矩阵如何相乘/转置等）？
torch.einsum在这种情况下实际上有好处吗？

python matrix-multiplication pytorch tensor

anu*_*rag

2023 01-08

0
推荐指数

1
解决办法

2224
查看次数

如果 C 是行优先顺序，为什么 ARM 内在代码采用列优先顺序？

我不确定问这个问题的最佳地点在哪里，但我目前正在使用 ARM 内在函数并遵循本指南：https : //developer.arm.com/documentation/102467/0100/Matrix-multiplication-example

但是，那里编写的代码假设数组是按列优先顺序存储的。我一直认为 C 数组是按行优先存储的。他们为什么要这样假设？

编辑：例如，如果不是这样：

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

他们这样做了：

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx …

Run Code Online (Sandbox Code Playgroud)

c optimization matrix-multiplication neon row-major-order

Ver*_*sed

2021 05-31

0
推荐指数

1
解决办法

92
查看次数

使用 OMP 加速对称矩阵的计算

我的矩阵计算是：C=CA*B

这里 C 是一个对称矩阵，所以我想通过只考虑上三角形然后取相反的 elelement 来加速这个计算。我使用了 OMP，发现我的实现比整个矩阵 C 的正常计算慢。

我还看到 C=C-AxB 的计算比 C=C+AxB 慢。

附上我的程序。请建议我！

    Program testspeed
implicit none
integer nstate,nmeas,i,j,l
integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time
double precision, allocatable, dimension(:,:):: B,C,A
nstate =20000
nmeas=10000
allocate (B(nmeas,nstate),C(nstate,nstate),A(nstate,nmeas))
A=1d0
B=1d0
call system_clock(tclock1)
write(*,*) "1"
!$omp parallel do
do j = 1, nstate
    do l = 1,nmeas
        do i = 1, j
            C(j,i) = C(j,i) - A(j,l)*B(l,i)
            C(i,j)=C(j,i)
        end do
    end do
end do
!$omp end parallel do
write(*,*) "2" …

Run Code Online (Sandbox Code Playgroud)

fortran transpose openmp matrix-multiplication intel-fortran

nvh*_*h10

lucky-day

0
推荐指数

1
解决办法

77
查看次数

Cuda Tensor Cores：NumBlocks 和 ThreadsPerBlock 的作用是什么？

我想知道 NumBlocks 和 ThreadsPerBlock 对这个简单的矩阵乘法例程的影响是什么

__global__ void wmma_matrix_mult(half *a, half *b, half *out) {

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, half> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, N);
   wmma::load_matrix_sync(b_frag, b, N);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}

Run Code Online (Sandbox Code Playgroud)

呼唤

`wmma_matrix_mult<<1, 1>>`: Incorrect
`wmma_matrix_mult<<1, 2>>`: …

Run Code Online (Sandbox Code Playgroud)

cuda matrix-multiplication cuda-wmma

bin*_*Int

2024 01-09

0
推荐指数

1
解决办法

346
查看次数

从 C 中的函数返回二维数组

我将一个矩阵输入到一个函数中，我想输出它本身的乘法。我无法设法以正确的格式返回结果。

int **multiplyMatrix(int matrixA[10][10], int matrixB[10][10], int n)
{
    int matrixC[10][10] = { 0 };
    int rowC, columnC = 0;
    int i = 0;
    int* ptr = matrixC;
    for (rowC = 0; rowC < n; rowC++)
    {
        for (columnC = 0; columnC < n; columnC++)
        {
            i = 0;
            for (i = 0; i < n; i++)
            {
                matrixC[rowC][columnC] += matrixA[rowC][i] * matrixB[i][columnC];
            }
        }
    }
    return *ptr;
}

Run Code Online (Sandbox Code Playgroud)

乘法工作正常，如果我在函数中打印矩阵，我会得到正确的结果，但我无法将值返回到 main()

void main()
{
    int n=2;
    int matrixA[10][10] = { …

Run Code Online (Sandbox Code Playgroud)

c types pointers matrix-multiplication

作者

lucky-day

0
推荐指数

1
解决办法

52
查看次数

R中的行乘法

我试图将两个具有相等列但不等行的数据帧相乘.想法是将数据集B中的每一行与数据集A中的每一行相乘.数据集A.

**Category  a1  a2  a3  a4  a5  a6  a7**
Food        10  15  28  30  60  33  35
Homecare    14  19  32  34  64  37  39
Apparel     17  22  35  37  67  40  42
Personal    30  35  48  50  80  53  55
AlcBever    33  38  51  53  83  56  58
Footwear    40  45  58  60  90  63  65
NonAlcBev   25  30  43  45  75  48  50

Run Code Online (Sandbox Code Playgroud)

数据集B.

    **Country   b1  b2  b3  b4  b5  b6  b7**
USA            0.5  0.3 0.1 0.4 0.7 …

Run Code Online (Sandbox Code Playgroud)

row r matrix-multiplication

use*_*176

lucky-day

-2
推荐指数

1
解决办法

194
查看次数

为什么这个矩阵乘法代码不起作用

#include<stdio.h>
#include<conio.h>
int main()
{
int ar1[3][3] = {{1,0,0},{0,1,0},{0,0,1}};
int ar2[3][3] = {{1,2,3},{4,5,6},{7,8,9}};
int ar3[3][3];
int i,j,k;
for(i=0;i<3;i++)
{
    ar3[i][j] = 0;
    for(j=0;j<3;j++)
    {
        for(k=0;k<3;k++)
        {
            ar3[i][j] = ar3[i][j]+(ar1[i][k]*ar2[k][j]);
        }
    }
}
for(i=0;i<3;i++)
{
    for(j=0;j<3;j++);
    printf("%d\t",ar3[i][j]);
}
getch();
return 0;
}

Run Code Online (Sandbox Code Playgroud)

当我在Dev C++中编译代码时,它没有给出任何错误但是无法运行并且应用程序停止工作.它出什么问题了？

c c++ arrays multidimensional-array matrix-multiplication

Sou*_*wal

2016 03-22

-2
推荐指数

1
解决办法

135
查看次数

如何用C语言计算矩阵乘法

我试过这个代码,但有些不对劲

        for (i = 0; i < row1; i++) {
        for (j = 0; j < col2; j++)
            suma = 0;
            for (l = 0; l < row2; l++)
            suma += a[i][l] * bt[l][j];
            c[i][j] = suma;             
    }
    printf("\nMultiplication of 2 matrices:\n");
    for (i = 0; i < row1; i++) {
        for (j = 0; j < col2; j++)
            printf("%2d", c[i][j]);
        printf("\n");
    }

Run Code Online (Sandbox Code Playgroud)

当我调试它时,它会在行和列中打印出随机数(类似于-895473)

c matrix-multiplication

作者

lucky-day

-3
推荐指数

1
解决办法

43
查看次数

使用 Numba 进行矩阵乘法时出现 CUDA 内存不足错误

我需要将矩阵与其转置相乘，但我的 GPU 内存不足并出现错误消息numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

我预计矩阵的大小约为 10k 行和 100k 列，因此将其与其 trnspose 相乘将得到 10k 行和 10k 列的方阵的结果。矩阵只包含0和1。

这是我正在运行的脚本。

from numba import cuda, uint16
import numba
import numpy
import math
import time

TPB = 16

@cuda.jit()
def matmul_shared_mem(A, B, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint16)
    sB = cuda.shared.array((TPB, TPB), dtype=uint16)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    if x >= C.shape[0] and y >= C.shape[1]:
        return
    tmp = 0.
    for i in range(int(A.shape[1] …

Run Code Online (Sandbox Code Playgroud)

cuda matrix-multiplication pycuda numba

sec*_*ive

2021 04-23

-3
推荐指数

1
解决办法

1542
查看次数