Che*_*erL 5 python algorithm statistics
问题是:我有一组数字,我需要将其分为 k 个子集。我必须找到最佳的划分策略来获得每个子集的方差之和最小。子集不能为空(方差是标准差的平方。)
k是大于0的整数。近似可以是1e+7
到目前为止,这是我的解决方案,它适用于某些示例,但并不总是适用:
Python代码:
class MinimumVariancePartition(object):
def minDev(self, mixedSamples, k):
# mixedSamples is a tuple, k is an integer.
samples_ascending = sorted(mixedSamples)
# Build a list of lists contains indices and distances.
idx_dist = []
for index in range(len(samples_ascending) - 1):
starting_idx = index
dist = abs(samples_ascending[index] - samples_ascending[index + 1])
idx_dist.append([starting_idx, dist])
sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)
# Get a list of indices to split the sample.
split_idx = []
for i in range(k - 1):
split_idx.append(sorted_idx_dist[i][0])
# Get a list of subsets.
partitions = []
if len(split_idx) == 0:
partitions.append(mixedSamples)
else:
split_idx = sorted(split_idx)
prev = 0
for idx in split_idx:
partitions.append(samples_ascending[prev:idx + 1])
prev = idx + 1
partitions.append(samples_ascending[prev:])
# Compute the sum of variances
result = 0
for partition in partitions:
variance = self.variance(partition)
result += variance
return result
def variance(self, partition):
# Compute variance of a subset
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14)**2 # use round() to avoid float number 'trick'
variance += temp
variance /= size
return variance
Run Code Online (Sandbox Code Playgroud)
通过测试:
input: (3, 4, 7, 10), 1
output: 7.5
input: (1000,500,1,500), 3
output: 0.0
input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672
Run Code Online (Sandbox Code Playgroud)
测试失败:
input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889
input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667
input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875
Run Code Online (Sandbox Code Playgroud)
我认为我的解决方案中的问题可能出在查找分区索引步骤中,但仍然不知道为什么它不起作用。
它不起作用,因为您的算法的想法不正确(仅考虑两个相邻元素之间的距离的分割并不总是产生最佳解决方案)。
您可以改用动态编程:
1. 对数组进行排序。
2.f(first_free, sets_count)如果该first_free元素是尚未添加到任何集合中的第一个元素并且sets_count已经创建了确切的集合,那么我们假设 是最小方差和。
3. 基本情况是f(0, 0) = 0。它对应于一个空前缀。
4. 过渡看起来是这样的:
for first_free = 0 ... n - 1:
for new_first_free = first_free + 1 ... n:
for sets_count = 0 ... k - 1:
f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])
Run Code Online (Sandbox Code Playgroud)
f(n, k)(其中n是集合中元素的数量)。这是我的实现(它可以优化,这只是一个草图,但它可以正常工作):
a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10
def get_variance(partition):
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14) ** 2
variance += temp
variance /= size
return variance
def calc(pos, cnt):
if (pos, cnt) in mem.keys():
return mem[(pos, cnt)]
if pos == 0 and cnt >= 0:
return 0.0
if cnt < 0:
return INF
res = INF
for old_pos in range(0, pos):
res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
mem[(pos, cnt)] = res
return res
if __name__ == '__main__':
a.sort()
print(calc(len(a), k))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3081 次 |
| 最近记录: |