bei*_*ner 6 python vectorization pytorch
我正在尝试在 Pytorch 中矢量化以下 for 循环。我很高兴只对内部 for 循环进行矢量化,但完成整个批次也很棒。
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
Run Code Online (Sandbox Code Playgroud)
我一直在绞尽脑汁思考如何处理不规则大小的张量。每个批次中的类数量K_i不同,每个子集的大小也不同。
事实证明,实际上可以对不规则数组进行矢量化。我将使用 numpy,但代码应该可以直接翻译为 torch。关键技术是:
n x d对于矩阵X和n-length 数组的单个(非批量)输入label,以下命令返回k x d质心和n到各自质心的 -length 距离:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Run Code Online (Sandbox Code Playgroud)
批处理几乎不需要特殊处理。对于带有批次标签的输入B x n x d数组,为每个批次创建唯一的标签:batch_XB x nbatch_labels
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
Run Code Online (Sandbox Code Playgroud)
因此,现在每个批处理元素都有一个唯一的连续标签范围。即,第一个批处理元素将n在某个范围[0, k0)内具有标签k0 = batch_k[0],第二个元素将具有范围[k0, k0 + k1)在其中k1 = batch_k[1],等等。
然后只需将n x B x d输入展平n*B x d并调用相同的向量化方法即可。您的损失函数可以使用最终距离和基于相同位置数组的缩减技术来导出。
有关矢量化工作原理的详细说明,请参阅我的博客文章。