一次热编码期间出现 RunTimeError

Question

一次热编码期间出现 RunTimeError

我有一个数据集，其中类值以 1 步从 -2 到 2 变化(i.e., -2,-1,0,1,2)，其中 9 标识未标记的数据。使用一种热编码

self._one_hot_encode(labels)

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：RuntimeError: index 1 is out of bounds for dimension 1 with size 1

由于

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

Run Code Online (Sandbox Code Playgroud)

错误应该从引发[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1]，其中我在映射设置中有 9 等于索引 9 到 1。我不清楚如何修复它，即使在浏览了过去的问题和类似问题的答案之后（例如，索引 1 超出范围）对于尺寸 0 和尺寸 1）。涉及错误的部分代码如下：

def _one_hot_encode(self, labels):
    # Get the number of classes
    classes = torch.unique(labels)
    classes = classes[classes != 9] # unlabelled 
    self.n_classes = classes.size(0)

    # One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
    unlabeled_mask = (labels == 9)
    labels = labels.clone()  # defensive copying
    labels[unlabeled_mask] = 0
    self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
    self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
    self.one_hot_labels[unlabeled_mask, 0] = 0

    self.labeled_mask = ~unlabeled_mask

def fit(self, labels, max_iter, tol):
    
    self._one_hot_encode(labels)

    self.predictions = self.one_hot_labels.clone()
    prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)

    for i in range(max_iter):
        # Stop iterations if the system is considered at a steady state
        variation = torch.abs(self.predictions - prev_predictions).sum().item()
        

        prev_predictions = self.predictions
        self._propagate()

Run Code Online (Sandbox Code Playgroud)

数据集示例：

ID  Target  Weight  Label   Score   Scale_Cat   Scale_num
0   A   D   65.1    1   87  Up  1
1   A   X   35.8    1   87  Up  1
2   B   C   34.7    1   37.5    Down    -2
3   B   P   33.4    1   37.5    Down    -2
4   C   B   33.1    1   37.5    Down    -2
5   S   X   21.4    0   12.5    NA  9

Run Code Online (Sandbox Code Playgroud)

我用作参考的源代码位于： https: //mybinder.org/v2/gh/thibaudmartinez/label-propagation/master? filepath=notebook.ipynb

错误的完整跟踪：

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-126-792a234f63dd> in <module>
      4 label_propagation = LabelPropagation(adj_matrix_t)
----> 6 label_propagation.fit(labels_t) # causing error
      7 label_propagation_output_labels = label_propagation.predict_classes()
      8 

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
    100 
    101     def fit(self, labels, max_iter=1000, tol=1e-3):
--> 102         super().fit(labels, max_iter, tol)
    103 
    104 ## Label spreading

<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
     58             Convergence tolerance: threshold to consider the system at steady state.
     59         """
---> 60         self._one_hot_encode(labels)
     61 
     62         self.predictions = self.one_hot_labels.clone()

<ipython-input-115-54a7dbc30bd1> in _one_hot_encode(self, labels)
     42         labels[unlabeled_mask] = 0
     43         self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
---> 44         self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
     45         self.one_hot_labels[unlabeled_mask, 0] = 0
     46 

RuntimeError: index 1 is out of bounds for dimension 1 with size 1

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 2

我浏览了你的笔记本（我认为你将 9 更改为 -1 以便运行）并看到这部分代码：

# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
print("Label Propagation: ", end="")
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()

Run Code Online (Sandbox Code Playgroud)

最终调用：

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

Run Code Online (Sandbox Code Playgroud)

是出了问题的地方。

花一点时间阅读一下关于 scatter 的 pytorch 手册：torch Scatter，我们了解到对于 scatter 来说，理解 dim、index、src 和 self 矩阵非常重要。对于一种热编码，dim=1 或 0 并不重要，我们的 src 矩阵是 1（稍后我们将对此进行更多研究）。您现在在维度 1 上调用 scatter，索引矩阵为 [40,1]，结果（自身）矩阵为 [40,5]。

我在这里看到两个问题：

您正在使用文字类别虚拟变量 (-2,-1,0,1,2) 作为索引矩阵中的编码索引。这将导致 scatter 在 src 矩阵中搜索这些索引。这是索引越界的地方
您提到有 6 个类 -2、-1、0、1、2 和 9 为未标记的，但您是 5 个类的热门编码。（是的，我知道您希望未标记的类全部为零，但这用分散实现有点困难。我稍后会解释）。

那么我们该如何解决这个问题呢？

问题一：让我们从一个小例子开始：

index = torch.tensor([[5],[0],[3],[5],[1],[4]]); print(index.shape); print(index)
result = torch.zeros(6, 6, dtype=src.dtype).scatter_(1, index, src); print(result.shape); print(result)

Run Code Online (Sandbox Code Playgroud)

这会给我们

torch.Size([6, 1])
tensor([[5],
        [0],
        [3],
        [5],
        [1],
        [4]])
torch.Size([6, 6])
tensor([[0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

Run Code Online (Sandbox Code Playgroud)

索引矩阵是 6 个观测值，有 1 个观测值（类别）自矩阵是 6 个观测值，有 6 个类别 1 个热编码向量 scatter(dim=1) 创建自矩阵的方式是 torch 首先检查行（观测值），然后更改该行的值与 src 矩阵中存储的值的值（位于索引中存储的值的同一行但列处）。

self[i][index[i][j][k]][k] = src[i][j][k]

Run Code Online (Sandbox Code Playgroud)

因此，在您的情况下，您试图将 1 的值应用到 self[40,1] 中索引 [0] 列（等于 1）的行中。给你问题中的错误。虽然我检查了你的笔记本，错误是索引 -1 超出了尺寸 1 和尺寸 5 的范围。它们都是相同的根本原因。

问题 2：One-hot 编码

在这种情况下，使用冷编码进行完整的one-hot 比one-hot 更容易。原因是，对于单热冷编码，您需要在 src 矩阵中为每个未标记的观察创建一个 0 值。这比仅仅使用 1 作为 src 更痛苦。另请阅读此链接：OHE 全零是否有效？我认为对每个类别都使用 one-hot 更有意义。

因此，对于第二个问题，我们只需要简单地将类别映射到结果/自身矩阵的索引中。由于我们有 6 个类别，我们只需将它们映射到 0,1,2,3,4,5 即可。一个简单的 lambda 函数就可以解决这个问题。我使用随机采样器从类列表中获取数据标签，如下所示：（我从 6 个类中随机创建了 40 个观察值）

torch.Size([6, 1])
tensor([[5],
        [0],
        [3],
        [5],
        [1],
        [4]])
torch.Size([6, 6])
tensor([[0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

Run Code Online (Sandbox Code Playgroud)

最终，我们达到了我们想要的OHE结果：

tensor([[0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1, 0],
        ... (40 observations)
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，8 月前
查看次数：	1260 次
最近记录：	4 年，8 月前