划分集合以获得子集方差最小和的策略

Che*_*erL 5 python algorithm statistics

问题是:我有一组数字,我需要将其分为 k 个子集。我必须找到最佳的划分策略来获得每个子集的方差之和最小。子集不能为空(方差是标准差的平方。)

k是大于0的整数。近似可以是1e+7

到目前为止,这是我的解决方案,它适用于某些示例,但并不总是适用:

  1. 按升序对样本(一组数字)进行排序。
  2. 计算两个连续元素的距离。构造一个列表列表,子列表有左元素的索引和距离(即[[idx, dist], [idx, dist]......])。按距离降序对列表进行排序。
  3. 使用我拥有的列表中的索引,从左到右获取索引来划分升序排序的样本

Python代码:

class MinimumVariancePartition(object):
    def minDev(self, mixedSamples, k):
        # mixedSamples is a tuple, k is an integer.

        samples_ascending = sorted(mixedSamples)

        # Build a list of lists contains indices and distances.
        idx_dist = []
        for index in range(len(samples_ascending) - 1):
            starting_idx = index
            dist = abs(samples_ascending[index] - samples_ascending[index + 1])
            idx_dist.append([starting_idx, dist])

        sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)

        # Get a list of indices to split the sample.
        split_idx = []
        for i in range(k - 1):
            split_idx.append(sorted_idx_dist[i][0])

        # Get a list of subsets.    
        partitions = []
        if len(split_idx) == 0:
            partitions.append(mixedSamples)
        else:
            split_idx = sorted(split_idx)
            prev = 0
            for idx in split_idx:
                partitions.append(samples_ascending[prev:idx + 1])
                prev = idx + 1
            partitions.append(samples_ascending[prev:])

        # Compute the sum of variances
        result = 0
        for partition in partitions:
            variance = self.variance(partition)
            result += variance
        return result

    def variance(self, partition):
        # Compute variance of a subset
        size = len(partition)
        s = sum(partition)
        mean = float(s) / size
        variance = 0
        for n in partition:
            temp = round(n - mean, 14)**2  # use round() to avoid float number 'trick'
            variance += temp
        variance /= size
        return variance
Run Code Online (Sandbox Code Playgroud)

通过测试:

input: (3, 4, 7, 10), 1
output: 7.5

input: (1000,500,1,500), 3
output: 0.0

input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672
Run Code Online (Sandbox Code Playgroud)

测试失败:

input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889

input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667

input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875
Run Code Online (Sandbox Code Playgroud)

我认为我的解决方案中的问题可能出在查找分区索引步骤中,但仍然不知道为什么它不起作用。

kra*_*ich 5

它不起作用,因为您的算法的想法不正确(仅考虑两个相邻元素之间的距离的分割并不总是产生最佳解决方案)。

您可以改用动态编程:
1. 对数组进行排序。
2.f(first_free, sets_count)如果该first_free元素是尚未添加到任何集合中的第一个元素并且sets_count已经创建了确切的集合,那么我们假设 是最小方差和。
3. 基本情况是f(0, 0) = 0。它对应于一个空前缀。
4. 过渡看起来是这样的:

for first_free = 0 ... n - 1:
    for new_first_free = first_free + 1 ... n:
        for sets_count = 0 ... k - 1:
            f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
                f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])
Run Code Online (Sandbox Code Playgroud)
  1. 答案f(n, k)(其中n是集合中元素的数量)。

这是我的实现(它可以优化,这只是一个草图,但它可以正常工作):

a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10


def get_variance(partition):
    size = len(partition)
    s = sum(partition)
    mean = float(s) / size
    variance = 0
    for n in partition:
        temp = round(n - mean, 14) ** 2
        variance += temp
    variance /= size
    return variance


def calc(pos, cnt):
    if (pos, cnt) in mem.keys():
        return mem[(pos, cnt)]
    if pos == 0 and cnt >= 0:
        return 0.0
    if cnt < 0:
        return INF
    res = INF
    for old_pos in range(0, pos):
        res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
    mem[(pos, cnt)] = res
    return res


if __name__ == '__main__':
    a.sort()
    print(calc(len(a), k))
Run Code Online (Sandbox Code Playgroud)

  • @ChesterL 仅当生成的分区始终由连续的子数组组成时,动态编程才能正确工作。当且仅当数组已排序时才为真。 (2认同)