Jay*_*ens 9 c# statistics histogram
为了计算直方图,我需要生成箱子.语言是C#.基本上我需要接收一个十进制数组并从中生成直方图.
无法找到一个像样的库直接这样做,所以现在我只是寻找一个库或算法来帮助我进行数据的分级.
所以...
Jak*_*son 15
这是我使用的简单桶功能.遗憾的是,.NET泛型不支持数字类型约束,因此您必须为decimal,int,double等实现以下函数的不同版本.
public static List<int> Bucketize(this IEnumerable<decimal> source, int totalBuckets)
{
var min = source.Min();
var max = source.Max();
var buckets = new List<int>();
var bucketSize = (max - min) / totalBuckets;
foreach (var value in source)
{
int bucketIndex = 0;
if (bucketSize > 0.0)
{
bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
{
bucketIndex--;
}
}
buckets[bucketIndex]++;
}
return buckets;
}
Run Code Online (Sandbox Code Playgroud)
我使用@JakePearson接受的答案得到了奇怪的结果.它与边缘情况有关.
这是我用来测试他的方法的代码.我稍微改变了扩展方法,返回int[]并接受double而不是decimal.
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
Random rand = new Random(1325165);
int maxValue = 100;
int numberOfBuckets = 100;
List<double> values = new List<double>();
for (int i = 0; i < 10000000; i++)
{
double value = rand.NextDouble() * (maxValue+1);
values.Add(value);
}
int[] bins = values.Bucketize(numberOfBuckets);
PointPairList points = new PointPairList();
for (int i = 0; i < numberOfBuckets; i++)
{
points.Add(i, bins[i]);
}
zedGraphControl1.GraphPane.AddBar("Random Points", points,Color.Black);
zedGraphControl1.GraphPane.YAxis.Title.Text = "Count";
zedGraphControl1.GraphPane.XAxis.Title.Text = "Value";
zedGraphControl1.AxisChange();
zedGraphControl1.Refresh();
}
}
public static class Extension
{
public static int[] Bucketize(this IEnumerable<double> source, int totalBuckets)
{
var min = source.Min();
var max = source.Max();
var buckets = new int[totalBuckets];
var bucketSize = (max - min) / totalBuckets;
foreach (var value in source)
{
int bucketIndex = 0;
if (bucketSize > 0.0)
{
bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
{
bucketIndex--;
}
}
buckets[bucketIndex]++;
}
return buckets;
}
}
Run Code Online (Sandbox Code Playgroud)
当使用0到100之间的10,000,000个随机双精度值(不包括)时,一切运行良好.每个桶具有大致相同数量的值,这在Random返回正态分布时是有意义的.

但是当我改变了价值生成线时
double value = rand.NextDouble() * (maxValue+1);
Run Code Online (Sandbox Code Playgroud)
至
double value = rand.Next(0, maxValue + 1);
Run Code Online (Sandbox Code Playgroud)
然后你得到以下结果,它将最后一个桶重复计算.

看起来当一个值与存储桶的一个边界相同时,写入的代码会将该值放入不正确的存储桶中.由于double随机数等于桶的边界的机会很少并且不明显,因此该伪像似乎不随随机值发生.
我纠正这个问题的方法是定义桶边界的哪一侧是包容性的而不是排他性的.
考虑到
0< x <=1 1< x <=2 ... 99< x <=100
与
0<= x <1 1<= x <2 ... 99<= x <100
您不能同时包含这两个边界,因为如果您具有与边界完全相等的值,则该方法将不知道将其放入哪个存储桶.
public enum BucketizeDirectionEnum
{
LowerBoundInclusive,
UpperBoundInclusive
}
public static int[] Bucketize(this IList<double> source, int totalBuckets, BucketizeDirectionEnum inclusivity = BucketizeDirectionEnum.UpperBoundInclusive)
{
var min = source.Min();
var max = source.Max();
var buckets = new int[totalBuckets];
var bucketSize = (max - min) / totalBuckets;
if (inclusivity == BucketizeDirectionEnum.LowerBoundInclusive)
{
foreach (var value in source)
{
int bucketIndex = (int)((value - min) / bucketSize);
if (bucketIndex == totalBuckets)
continue;
buckets[bucketIndex]++;
}
}
else
{
foreach (var value in source)
{
int bucketIndex = (int)Math.Ceiling((value - min) / bucketSize) - 1;
if (bucketIndex < 0)
continue;
buckets[bucketIndex]++;
}
}
return buckets;
}
Run Code Online (Sandbox Code Playgroud)
现在唯一的问题是,如果输入数据集具有大量的最小值和最大值,则分箱方法将排除许多这些值,并且结果图将错误地表示数据集.