使用 Patch Embedding 进行视频处理的尺寸误差

Question

使用 Patch Embedding 进行视频处理的尺寸误差

我正在研究一种被提议用于视频分类的变压器模型。我的输入张量的形状为 [batch=16 ,channels=3 ,frames=16, H=224, W=224] ，为了在输入张量上应用补丁嵌入，它使用以下场景：

patch_dim = in_channels * patch_size ** 2
self.to_patch_embedding = nn.Sequential(
        Rearrange('b t c (h p1) (w p2) -> b t (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
        nn.Linear(patch_dim, dim),     ***** (Root of the error)******
    )

Run Code Online (Sandbox Code Playgroud)

我使用的参数如下：

 patch_size =16 
 dim = 192
 in_channels = 3

Run Code Online (Sandbox Code Playgroud)

不幸的是，我收到与代码中显示的行相对应的以下错误：

Exception has occured: RuntimeError
mat1 and mat2 shapes cannot be multiplied (9408x4096 and 768x192)

Run Code Online (Sandbox Code Playgroud)

我想了很多错误的原因，但我无法找出原因是什么。我该如何解决这个问题？

Answer 1

Mar*_*kus 5

输入张量具有形状[batch=16, channels=3, frames=16, H=224, W=224]，同时Rearrange期望尺寸按顺序排列[ b t c h w ]。你期望channels但通过了frames。这导致了最后一个维度(p1 * p2 * c) = 16 * 16 * 16 = 4096。

请尝试对齐通道和框架的位置：

from torch import torch, nn
from einops.layers.torch import Rearrange

patch_size = 16
dim = 192

b, f, c, h, w = 16, 16, 3, 224, 224
input_tensor = torch.randn(b, f, c, h, w)

patch_dim = c * patch_size ** 2

m = nn.Sequential(
    Rearrange('b t c (h p1) (w p2) -> b t (h w) (p1 p2 c)', p1=patch_size, p2=patch_size),
    nn.Linear(patch_dim, dim)
)

print(m(input_tensor).size())

Run Code Online (Sandbox Code Playgroud)

输出：

torch.Size([16, 16, 196, 192])

Run Code Online (Sandbox Code Playgroud)

归档时间：	2 年，10 月前
查看次数：	119 次
最近记录：	2 年，10 月前