为什么用Perl的List :: Util :: shuffle会得到不好的随机分布?

bla*_*ght 0 arrays perl shuffle

我收集了数百个黑胶唱片,由目录ID字符串按字母数字顺序组织。我编写了一个脚本,该脚本通过对随机排列的目录ID数组进行采样,从我的收藏集中随机选择20条记录。但是,我发现它为我选择的记录常常分布不佳。通常,它会选择2个具有顺序目录ID的记录,和/或几组彼此靠近的记录。从800条记录中选择20条记录时,这种情况很少发生。

我将目录ID的列表存储在@selection数组中,并从该数组中随机抽取20个项目的样本,我从混洗后的数组中分配前20个项目:

@selection = (shuffle @selection)[0 .. 19];
Run Code Online (Sandbox Code Playgroud)

无奈之下,我尝试使用这种丑陋的技术来试图增强随机性,但似乎没有什么区别:

@selection = shuffle @selection; sleep 1;
@selection = reverse @selection; sleep 1;
@selection = (shuffle @selection)[0 .. 19];
Run Code Online (Sandbox Code Playgroud)

ike*_*ami 5

There are C(800, 20) = 3.73 × 1039 ways of choosing 20 titles from 800.

There are C(781, 20) = 2.29 × 1039 ways of choosing 20 titles from 800 where no two are adjacent.[1]

There is therefore a (2.29 × 1039) / (3.73 × 1039) = 61.4% chance of picking a set that contains no adjacent titles.

There is therefore a 1 - 61.4% = 38.6% chance of picking a set that contains adjacent titles.

Now that we know what to expect, let's put shuffle to the test.

Test:

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw( shuffle );

my $num_tests = 100_000;
my $N = 800;
my @titles = 0..($N-1);
my $has_adjacent_titles = 0;
for (1..$num_tests) {
   my @shuffled_selection = ( shuffle(@titles) )[0..19];
   my @ordered = sort { $a <=> $b } @shuffled_selection;
   ++$has_adjacent_titles if grep { $ordered[$_-1]+1 == $ordered[$_] } 1..$#ordered;
}

printf "%.1f%%\n", $has_adjacent_titles / $num_tests * 100;
Run Code Online (Sandbox Code Playgroud)

Output:

>a.pl
38.6%

>a.pl
38.8%

>a.pl
38.5%
Run Code Online (Sandbox Code Playgroud)

Seems like shuffle is working quite well.


  1. See Combinatorial restriction on choosing adjacent objects,