Yan*_*aud 0 random math statistics probability modulo
前言
这个问题与(P)RNG和rand()
.的行为无关.它是关于使用以模数均匀分布的两个值的幂.
介绍
我知道不应该使用modulo %
将值从一个范围转换为另一个范围,例如从rand()
函数中得到0到5之间的值:会有偏差.它在这里解释https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default并在这个答案中为什么人们说使用随机数发生器时存在模偏差?
但是今天在调查了一些看起来错误的代码之后,我已经制作了一个工具来演示模数的行为:https://gitorious.org/modulo-test/modulo-test/trees/master 并发现它不够清楚.
骰子只有3位
我检查了范围0..5中的6个值.编码这些值只需要3位.
$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
[0..7] => [0..5]
theorical occurences 1666.67 probability 0.16666667
[ 0] occurences 2446 probability 0.24460000 ( +46.76%)
[ 1] occurences 2535 probability 0.25350000 ( +52.10%)
[ 2] occurences 1275 probability 0.12750000 ( -23.50%)
[ 3] occurences 1297 probability 0.12970000 ( -22.18%)
[ 4] occurences 1216 probability 0.12160000 ( -27.04%)
[ 5] occurences 1231 probability 0.12310000 ( -26.14%)
minimum occurences 1216.00 probability 0.12160000 ( -27.04%)
maximum occurences 2535.00 probability 0.25350000 ( +52.10%)
mean occurences 1666.67 probability 0.16666667 ( +0.00%)
stddev occurences 639.43 probability 0.06394256 ( 38.37%)
Run Code Online (Sandbox Code Playgroud)
使用3位输入,结果确实很糟糕,但表现得如预期.请参阅答案/sf/answers/1023042961/
增加输入位数
令我困惑的是,增加输入位的数量使得结果不同.你不应该忘记增加迭代次数,例如样本数,否则结果可能是错误的(参见错误的统计数据).
让我们试试4位:
$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
[0..15] => [0..5]
theorical occurences 3333.33 probability 0.16666667
[ 0] occurences 3728 probability 0.18640000 ( +11.84%)
[ 1] occurences 3763 probability 0.18815000 ( +12.89%)
[ 2] occurences 3675 probability 0.18375000 ( +10.25%)
[ 3] occurences 3721 probability 0.18605000 ( +11.63%)
[ 4] occurences 2573 probability 0.12865000 ( -22.81%)
[ 5] occurences 2540 probability 0.12700000 ( -23.80%)
minimum occurences 2540.00 probability 0.12700000 ( -23.80%)
maximum occurences 3763.00 probability 0.18815000 ( +12.89%)
mean occurences 3333.33 probability 0.16666667 ( +0.00%)
stddev occurences 602.48 probability 0.03012376 ( 18.07%)
Run Code Online (Sandbox Code Playgroud)
让我们试试5位:
$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
[0..31] => [0..5]
theorical occurences 6666.67 probability 0.16666667
[ 0] occurences 7462 probability 0.18655000 ( +11.93%)
[ 1] occurences 7444 probability 0.18610000 ( +11.66%)
[ 2] occurences 6318 probability 0.15795000 ( -5.23%)
[ 3] occurences 6265 probability 0.15662500 ( -6.03%)
[ 4] occurences 6334 probability 0.15835000 ( -4.99%)
[ 5] occurences 6177 probability 0.15442500 ( -7.34%)
minimum occurences 6177.00 probability 0.15442500 ( -7.34%)
maximum occurences 7462.00 probability 0.18655000 ( +11.93%)
mean occurences 6666.67 probability 0.16666667 ( +0.00%)
stddev occurences 611.58 probability 0.01528949 ( 9.17%)
Run Code Online (Sandbox Code Playgroud)
让我们试试6位:
$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
[0..63] => [0..5]
theorical occurences 13333.33 probability 0.16666667
[ 0] occurences 13741 probability 0.17176250 ( +3.06%)
[ 1] occurences 13610 probability 0.17012500 ( +2.08%)
[ 2] occurences 13890 probability 0.17362500 ( +4.18%)
[ 3] occurences 13702 probability 0.17127500 ( +2.77%)
[ 4] occurences 12492 probability 0.15615000 ( -6.31%)
[ 5] occurences 12565 probability 0.15706250 ( -5.76%)
minimum occurences 12492.00 probability 0.15615000 ( -6.31%)
maximum occurences 13890.00 probability 0.17362500 ( +4.18%)
mean occurences 13333.33 probability 0.16666667 ( +0.00%)
stddev occurences 630.35 probability 0.00787938 ( 4.73%)
Run Code Online (Sandbox Code Playgroud)
题
请解释一下为什么在更改输入位时结果不同(并相应地增加样本数)?这些背后的数学推理是什么?
错误的统计数据
在问题的前一版本中,我展示了一个32位输入和仅1000000次迭代的测试,例如10 ^ 6个样本,并且说我很惊讶得到正确的结果.这是错误的我感到羞耻:必须有N倍的样本才能有信心获得发生器的所有2 ^ 32值.这里10 ^ 6是小到2 ^ 32的方式. 能够用数学/统计语言解释这一点的人的奖金..
这里结果不对:
$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
[0..4294967295] => [0..5]
theorical occurences 166666.67 probability 0.16666667
[ 0] occurences 166881 probability 0.16688100 ( +0.13%)
[ 1] occurences 166881 probability 0.16688100 ( +0.13%)
[ 2] occurences 166487 probability 0.16648700 ( -0.11%)
[ 3] occurences 166484 probability 0.16648400 ( -0.11%)
[ 4] occurences 166750 probability 0.16675000 ( +0.05%)
[ 5] occurences 166517 probability 0.16651700 ( -0.09%)
minimum occurences 166484.00 probability 0.16648400 ( -0.11%)
maximum occurences 166881.00 probability 0.16688100 ( +0.13%)
mean occurences 166666.67 probability 0.16666667 ( +0.00%)
stddev occurences 193.32 probability 0.00019332 ( 0.12%)
Run Code Online (Sandbox Code Playgroud)
我仍然需要阅读并重新阅读Zed Shaw 的优秀文章"程序员需要学习统计或我会杀死他们".
从本质上讲,你正在做:
(rand() & 7) % 6
Run Code Online (Sandbox Code Playgroud)
让我们假设它rand()
是均匀分布的[0; RAND_MAX]
,这RAND_MAX+1
是2的幂.很显然,rand() & 7
可以评估到0
,1
,... 7
,那结果是等概率.
现在让我们看一下当您采用模数结果时会发生什么6
.
这就解释了为什么你获得其他数字的两倍的零和一.
在第二种情况下也发生了同样的事情.然而,"额外"数字的值要小得多,使得它们的贡献与噪声无法区分.
总而言之,如果你有一个均匀分布在[ 0
; M-1
],你以模为模N
,结果将偏向零,除非可M
被整除N
.
归档时间: |
|
查看次数: |
382 次 |
最近记录: |