当索引张量不同维度时,如何向量化索引和计算?

bei*_*ner 6 python vectorization pytorch

我正在尝试在 Pytorch 中矢量化以下 for 循环。我很高兴只对内部 for 循环进行矢量化,但完成整个批次也很棒。

# B: the batch size
# N: the number of training examples 
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss

batch_data = torch.tensor(...)  # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...)  # Tensor of shape [B x N x 1], each element is one of K labels (ints)

batch_losses = []  # Ultimately should be [B x 1]
batch_centroids = []  # Ultimately should be [B x K_i x dim]
for i in range(B):
    centroids = []  # Keep track of the means for each class. 
    classes = torch.unique(labels)  # Get the unique labels for the classes.

    # NOTE: The number of classes K for each item in the batch might actually
    # be different. This may complicate batch-level operations.

    total_loss = 0

    # For each class independently. This is the part I want to vectorize.
    for cl in classes:
        # Take the subset of training examples with that label.
        subset = data[torch.where(labels == cl)]

        # Find the centroid of that subset.
        centroid = subset.mean(dim=0)
        centroids.append(centroid)
  
        # Get the distance between each point in the subset and the centroid.
        dists = subset - centroid
        norm = torch.linalg.norm(dists, dim=1)

        # The loss is the mean of the hinge loss across the subset.
        margin = norm - delta
        hinge = torch.clamp(margin, min=0.0) ** 2

        total_loss += hinge.mean()

    # Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
    loss = total_loss.mean()
    batch_losses.append(loss)
    batch_centroids.append(centroids)
   
   
Run Code Online (Sandbox Code Playgroud)

我一直在绞尽脑汁思考如何处理不规则大小的张量。每个批次中的类数量K_i不同,每个子集的大小也不同。

VF1*_*VF1 3

事实证明,实际上可以对不规则数组进行矢量化。我将使用 numpy,但代码应该可以直接翻译为 torch。关键技术是:

  1. 按参差不齐的数组成员资格排序
  2. 进行累积
  3. 查找边界索引,计算相邻差异

n x d对于矩阵Xn-length 数组的单个(非批量)输入label,以下命令返回k x d质心和n到各自质心的 -length 距离:

def vcentroids(X, label):
    """
    Vectorized version of centroids.
    """        
    # order points by cluster label
    ix = np.argsort(label)
    label = label[ix]
    Xz = X[ix]
    
    # compute pos where pos[i]:pos[i+1] is span of cluster i
    d = np.diff(label, prepend=0) # binary mask where labels change
    pos = np.flatnonzero(d) # indices where labels change
    pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
    pos = np.append(np.insert(pos, 0, 0), len(X))
    
    Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
    Xsums = np.cumsum(Xz, axis=0)
    Xsums = np.diff(Xsums[pos], axis=0)
    counts = np.diff(pos)
    c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
    
    repeated_centroids = np.repeat(c, counts, axis=0)
    aligned_centroids = repeated_centroids[inverse_permutation(ix)]
    dist = np.sum((X - aligned_centroids) ** 2, axis=1)
    
    return c, dist
Run Code Online (Sandbox Code Playgroud)

批处理几乎不需要特殊处理。对于带有批次标签的输入B x n x d数组,为每个批次创建唯一的标签:batch_XB x nbatch_labels

batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1) 
Run Code Online (Sandbox Code Playgroud)

因此,现在每个批处理元素都有一个唯一的连续标签范围。即,第一个批处理元素将n在某个范围[0, k0)内具有标签k0 = batch_k[0],第二个元素将具有范围[k0, k0 + k1)在其中k1 = batch_k[1],等等。

然后只需将n x B x d输入展平n*B x d并调用相同的向量化方法即可。您的损失函数可以使用最终距离和基于相同位置数组的缩减技术来导出。

有关矢量化工作原理的详细说明,请参阅我的博客文章