关于RetinaNet的困惑

Question

关于RetinaNet的困惑

Nic*_*eng 4 object-detection computer-vision deep-learning

我最近一直在学习RetinaNet。我阅读了原始论文和一些相关文章，并写了一篇帖子分享我的经验教训：http : //blog.zenggyu.com/en/post/2018-12-05/retinanet-explained-and-demystified/。但是，我仍然有一些困惑，我也在帖子中指出。谁能启发我？

困惑1

如论文所示，如果锚框的IoU与任何地面真相均低于0.4，则将其分配给背景。在这种情况下，对应的分类目标标签应该是什么（假设有K个类别）？

我知道SSD具有背景类别（总共使K + 1个类别），而YOLO预测置信度得分，该置信度得分指示除K类概率之外，盒子中是否存在物体（不是背景）（不是背景）。虽然我在论文中未找到任何表明RetinaNet包含背景类的陈述，但我确实看到了以下陈述：“ ...，我们仅将检测器的置信度阈值设为0.05后，才从...解码盒预测。”表明存在对置信度得分的预测。但是，此分数从何而来（由于分类子网仅输出表示K个类别的概率的K个数字）？

如果RetinaNet与SSD或YOLO定义目标标签的方式不同，我将假定目标是长度为K的矢量，所有条目均为0，没有1。但是，在这种情况下，如果它是假阴性，那么焦点损失（请参阅下面的定义）将如何惩罚锚点？

哪里

困惑2

与许多其他检测器不同，RetinaNet使用与类无关的包围盒回归器，而分类子网的最后一层的激活是S形激活。这是否意味着一个锚框可以同时预测不同类别的多个对象？

困惑＃3

让我们用$ {（A ^ i，G ^ i）} _ {i = 1，... N} $来表示这些匹配的锚点框和地面真相框对，其中$ A $代表锚点，$ G $代表基本事实，$ N $是匹配数。

对于每个匹配的锚点，回归子网会预测四个数字，我们将其表示为$ P ^ i =（P ^ i_x，P ^ i_y，P ^ i_w，P ^ i_h）$。前两个数字指定锚点$ A ^ i $和地面真值$ G ^ i $的中心之间的偏移，而后两个数字指定锚点的宽度/高度与地面真点之间的偏移。相应地，对于这些预测中的每一个，都有一个回归目标$ T ^ i $计算为锚点与地面真相之间的偏移量：

以上方程式正确吗？

在此先感谢您，并随时在帖子中指出其他误解！

更新：

供将来参考，这是我在学习RetinaNet时遇到的另一种困惑（我发现此对话很轻松）：

Answer 1

小智 7

我是开源视网膜网项目fizyr / keras-retinanet的作者之一。我会尽力回答您的问题。

困惑1

通常，在对象检测器中有两种常用的分类评分方法，您可以使用softmax或使用S型。

If you use softmax, your target values should always be one-hot vectors, meaning if there is no object you should "classify" it as background (meaning you need a background class). The benefit is that your class scores always sum up to one.

If you use sigmoid there are less constraints. This has two benefits in my opinion, you don't need a background class (which makes the implementation cleaner) and it allows the network to do multi-class classification (although it is not supported in our implementation, it is theoretically possible). A small additional benefit is that your network is slightly smaller, since it needs to classify one class less compared to softmax, though this is probably neglible.

In the early days of implementing retinanet we used softmax, because of legacy code from py-faster-rcnn. I contacted the author of the Focal Loss paper and asked him about the softmax/sigmoid situation. His answer was that it was a matter of personal preference and it doesn't matter much if you use one or the other. Because of the mentioned benefits for sigmoid, it is now my personal preference as well.

However, where does this score come from (since the classification subnet only outputs K numbers indicating the probability of K classes)?

Every class score is treated as its own object, but for one anchor they all share the same regression values. If the class score is above that threshold (which I'm pretty sure is arbitrarily chosen), it is considered a candidate object.

If RetinaNet defines target labels differently from SSD or YOLO, I would assume that the target is a length-K vector with all 0s entries and no 1s. However, in this case how does the focal loss (see definition below) will punish an anchor if it is a false negative?

Negatives are classified as a vector containing only zeros. Positives are classified by a one-hot vector. Assuming the prediction is a vector of all zeros but the target was a one-hot vector (in other words, a false negative), then p_t is a list of zeros in your formula. The focal loss will then evaluate to a large value for that anchor.

Confusion #2

Short answer: yes.

Confusion #3

With respect to the original implementation it is almost correct. All values are divided by the width or height. Dividing by A_x, A_y for the values of T_x and T_y is incorrect.

That said, a while back we switched to a slightly simpler implementation where the regression is computed as the difference between the top-left and bottom-right points instead (as a fraction w.r.t. the anchors' width and height). This simplified the implementation a bit since we use top-left / bottom-right throughout the code. In addition I noticed our results slightly increased on COCO.

归档时间：	7 年，7 月前
查看次数：	676 次
最近记录：	7 年，7 月前