Mas*_*dam 7 python algorithm numpy
假设我们有一个列表,在每次迭代中附加一个15,32之间的整数(让我们调用整数rand).我想设计一种算法,为每个人分配1(1.25到0.75)之间的奖励rand.分配奖励的规则是这样的.
首先我们计算列表的平均值.然后,如果rand超过平均值,我们预计奖励小于1,如果rand低于平均值,则奖励高于1.平均值之间的距离rand越大,奖励增加/减少的越多.例如:
rand = 15, avg = 23 then reward = 1.25
rand = 32, avg = 23 then reward = 0.75
rand = 23, avg = 23 then reward = 1
等等.
我为此算法开发了以下代码:
import numpy as np
rollouts = np.array([])
i = 0
def modify_reward(lst, rand):
reward = 1
constant1 = 0.25
constant2 = 1
std = np.std(lst)
global avg
avg = np.mean(lst)
sub = np.subtract(avg, rand)
landa = sub / std if std != 0 else 0
coefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa)))
md_reward = reward + (reward * constant1 * coefficient)
return md_reward
while i < 100:
rand = np.random.randint(15, 33)
rollouts = np.append(rollouts, rand)
modified_reward = modify_reward(rollouts, rand)
i += 1
print([i,rand, avg, modified_reward])
# test the reward for upper bound and lower bound
rand1, rand2 = 15, 32
reward1, reward2 = modify_reward(rollouts, rand1), modify_reward(rollouts, rand2)
print(['reward for upper bound', rand1, avg, reward1])
print(['reward for lower bound', rand2, avg, reward2])
Run Code Online (Sandbox Code Playgroud)
该算法工作得很好,但如果你看下面的例子,你会发现算法的问题.
rand = 15, avg = 23.94 then reward = 1.17 # which has to be 1.25
rand = 32, avg = 23.94 then reward = 0.84 # which has to be 0.75
rand = 15, avg = 27.38 then reward = 1.15 # which has to be 1.25
rand = 32, avg = 27.38 then reward = 0.93 # which has to be 0.75
您可能已经注意到,算法不考虑avg和边界之间的距离(15,32).越avg往移动下界或上界,越modified_reward变得不平衡.
modified_reward无论avg向上界还是下界移动,我都需要统一分配.任何人都可以建议对此算法进行一些修改,可以考虑avg列表之间的距离和边界.
综合这两个要求:
如果
rand超过平均值,我们预计奖励小于1,如果rand小于平均值,则奖励高于1.
modified_reward无论avg向上界还是下界移动,我都需要统一分配.
有点棘手,取决于你的"统一"的意思.
如果你希望15总是得到1.25的奖励,而32总是得到0.75的奖励,你就不能有一个单一的线性关系,同时也尊重你的第一个要求.
如果你很高兴与2个线性关系,你可以瞄准所处的环境modified_reward依赖于rand这样的:
我用这个Wolfram Alpha查询生成的.正如您所看到的,这是两个线性关系,其中"膝盖"位于avg.我希望你能够毫不费力地得出每个部分的公式.
此代码实现了与从平均值到给定限制的距离成比例的线性权重分布.
import numpy as np
class Rewarder(object):
lo = 15
hi = 32
weight = 0.25
def __init__(self):
self.lst = np.array([])
def append(self, x):
self.lst = np.append(self.lst, [x])
def average(self):
return np.mean(self.lst)
def distribution(self, a, x, b):
'''
Return a number between 0 and 1 proportional to
the distance of x from a towards b.
Note: Modify this fraction if you want a normal distribution
or quadratic etc.
'''
return (x - a) / (b - a)
def reward(self, x):
avg = self.average()
if x > avg :
w = self.distribution(avg, x, self.hi)
else:
w = - self.distribution(avg, x, self.lo)
return 1 - self.weight * w
rollouts = Rewarder()
rollouts.append(23)
print rollouts.reward(15)
print rollouts.reward(32)
print rollouts.reward(23)
Run Code Online (Sandbox Code Playgroud)
生产:
1.25
0.75
1.0
Run Code Online (Sandbox Code Playgroud)
您的问题中的代码似乎正在使用np.std我认为是尝试获得正常分布.请记住,正态分布实际上从未实现为零.
如果您告诉我您希望分发的形状,我们可以修改Rewarder.distribution以适应.
编辑:
我无法访问你所提到的论文,但推断你想要一个sigmoid风格的奖励分配给出平均值为0,最小值和最大值为+/- 0.25.使用误差函数作为加权,如果我们按2进行加权,我们在最小值和最大值时得到约0.995.
覆盖Rewarder.distribution:
import math
class RewarderERF(Rewarder):
def distribution(self, a, x, b):
"""
Return an Error Function (sigmoid) weigthing of the distance from a.
Note: scaled to reduce error at max to ~0.003
ref: https://en.wikipedia.org/wiki/Sigmoid_function
"""
return math.erf(2.0 * super(RewarderERF, self).distribution(a, x, b))
rollouts = RewarderERF()
rollouts.append(23)
print rollouts.reward(15)
print rollouts.reward(32)
print rollouts.reward(23)
Run Code Online (Sandbox Code Playgroud)
结果是:
1.24878131454
0.75121868546
1.0
Run Code Online (Sandbox Code Playgroud)
您可以选择适合您的应用程序的错误函数以及您可以在最小值和最大值时接受的错误数量.我也希望你将所有这些功能集成到你的课堂中,我把所有功能都分开了,这样我们就可以看到这些部分了.
关于计算均值,您是否需要保留值列表并每次重新计算,还是可以保持计数和总计的总和?那么你不需要numpy来进行这个计算.
我不明白你为什么要这样计算 md_reward 。请给出逻辑和理由。但
\n\nlanda = sub / std if std != 0 else 0\ncoefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa)))\nmd_reward = reward + (reward * constant1 * coefficient)\nRun Code Online (Sandbox Code Playgroud)\n\n不会提供您正在寻找的东西。因为让我们考虑以下情况
\n\nfor md_reward to be .75 \n--> coefficient should be -1\n --> landa == -infinite (negative large value, i.e. , rand should be much larger than 32)\n\nfor md_reward to be 1\n--> coefficient should be 0\n --> landa == 0 (std == 0 or sub == 0) # which is possible\n\nfor md_reward to be 1.25 \n--> coefficient should be 1\n --> landa == infinite (positive large value, i.e. , rand should be much smaller than 15)\nRun Code Online (Sandbox Code Playgroud)\n\n如果您想将奖励从平均值标准化为最大值,将平均值标准化为最小值。检查下面的链接。\n https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range \n https://stats.stackexchange.com/questions/70553 /归一化意味着什么以及如何验证样本或分布
\n\n现在使用以下内容修改您的函数。
\n\ndef modify_reward(lst, rand):\n reward = 1\n constant1 = 0.25\n min_value = 15\n max_value = 32\n avg = np.mean(lst)\n if rand >= avg:\n md_reward = reward - constant1*(rand - avg)/(max_value - avg) # normalize rand from avg to max\n else:\n md_reward = reward + constant1*(1 - (rand - min_value)/(avg - min_value)) # normalize rand from min to avg\n return md_reward\nRun Code Online (Sandbox Code Playgroud)\n\n我用过下面的方法
\n\nNormalized:\n(X\xe2\x88\x92min(X))/(max(X)\xe2\x88\x92min(X))\nRun Code Online (Sandbox Code Playgroud)\n\n对于案例rand >= avg
min(X) 为平均值,max(X) 为 max_value
\n\n对于案例rand < avg
min_value 中的 min(X) 和 max(X) 是平均值
\n\n希望这可以帮助。
\n| 归档时间: |
|
| 查看次数: |
377 次 |
| 最近记录: |