蒙特卡罗树中的上限置信区域在播放或访问时搜索为0

Question

蒙特卡罗树中的上限置信区域在播放或访问时搜索为0

我正在查看"上层置信区域"计算,因为它出现在"蒙特卡罗树搜索"算法中,我遇到了一个问题.

log is the natural log.
C is a weight for exploration over exploitation, for example 1.

simple_score = wins / played
UCB = simple_score + C * sqrt(log(parent's visited) / visited)

Run Code Online (Sandbox Code Playgroud)

播放或访问时出现问题为0.在这种情况下,我仍然需要单个,有限和完全定义的值.

我正在考虑在= 0的情况下使用这些可能性.

simple_score = 0
because the node has never won, although it's never lost either

simple_score = 0.5
because the node's value is completly uncertain and 0.5 is half way

UCB = simple_score + C * sqrt(parent's visited / 1)
UCB = simple_score
UCB = simple_score + C

Run Code Online (Sandbox Code Playgroud)

有人有答案吗？

Answer 1

fai*_*dox 5

每个强盗算法的第一步,包括MCTS,都是每次拉动一次.如果你在每个节点都这样做,这显然会导致穷举搜索,你只需要使用MCTS到固定的深度,然后使用推出策略.你当然可以使用先验,但是你失去了UCB算法的所有理论特性,主要是对数遗憾.

归档时间：	13 年，12 月前
查看次数：	2347 次
最近记录：	13 年，12 月前