在python函数中拆分数据时保持比率

Question

在python函数中拆分数据时保持比率

我有一些数据,我想把它分成更小的组,保持一个共同的比例.我写了一个函数,它将输入两个数组并计算大小比例,然后告诉我可以将它分成多少组的选项(如果所有组的大小相同),这里是函数:

def cross_validation_group(train_data, test_data):
    import numpy as np
    from calculator import factors
    test_length = len(test_data)
    train_length = len(train_data)
    total_length = test_length + train_length
    ratio = test_length/float(total_length)
    possibilities = factors(total_length)
    print possibilities
    print possibilities[len(possibilities)-1] * ratio
    super_count = 0
    for i in possibilities:
        if i < len(possibilities)/2:
            pass
        else: 
            attempt = float(i * ratio)
            if attempt.is_integer():
                print str(i) + " is an option for total size with " +  str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
            else:
                pass
    folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
    if folds != 0:
        total_size = total_length/folds
        test_size = float(total_size * ratio)
        train_size = total_size - test_size
        columns = train_data[0]
        columns= len(columns)
        groups = np.empty((folds,(test_size + train_size),columns))
        i = 0
        a = 0
        b = 0
        for j in range (0,folds):
            test_size_new = test_size * (j + 1)
            train_size_new = train_size * j
            total_size_new = (train_size + test_size) * (j + 1)
            cut_off = total_size_new - train_size
            p = 0
            while i < total_size_new:
                if i < cut_off:
                    groups[j,p] = test_data[a]
                    a += 1
                else:
                    groups[j,p] = train_data[b]
                    b += 1
                i += 1
                p += 1
        return groups
    else:
        print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"

Run Code Online (Sandbox Code Playgroud)

所以我的问题是如何才能使它成为函数的第三个输入,它将是折叠的数量并改变函数,而不是迭代以确保每个组具有相同的数量和正确的比率,它会有正确的比例,但大小不一？

@JamesHolderness的补充

所以你的方法几乎是完美的,但这里有一个问题:

长度为357和143,有9折,这是返回列表:

[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]

Run Code Online (Sandbox Code Playgroud)

现在,当你添加列时,你会得到这个: 351 144

351很好,因为它不到357,但144不起作用,因为它大于143!原因是357和143是数组的长度,因此该数组的第144行不存在......

Answer 1

Jam*_*ess 4

我认为这是一个可能适合您的算法。

您将 test_length 和 train_length 除以它们的 GCD，得到简单分数的比率。您取分子和分母，然后将它们加在一起，这就是您的组的大小因子。

例如，如果比率为 3:2，则每个组的大小必须是 5 的倍数。

然后，您获取总长度并将其除以折叠数以获得第一组的理想大小，该大小很可能是浮点数。你找到小于或等于该值的最大 5 倍数，这就是你的第一组。

从总数中减去该值，然后除以 Fold-1 以获得下一组的理想大小。再次找到 5 的最大倍数，从总数中减去，然后继续计算，直到计算完所有组。

一些示例代码：

total_length = test_length + train_length          
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple 

# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
  total_multiple = total_length/folds
  test_multiple = int(round(float(test_length)*total_multiple/total_length))
  train_multiple = total_multiple - test_multiple

groups = []
for i in range(folds,0,-1):
  float_size = float(total_length)/i
  int_size = int(float_size/total_multiple)*total_multiple
  test_size = int_size*test_multiple/total_multiple
  train_size = int_size*train_multiple/total_multiple
  test_length -= test_size    # keep track of the test data used
  train_length -= train_size  # keep track of the train data used
  total_length -= int_size
  groups.append((test_size,train_size))

# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)

Run Code Online (Sandbox Code Playgroud)

它已更新，以跟踪每个组（测试和训练）使用的大小，但不用担心我们最初使用太多。

然后最后，如果有任何超支（即test_length或train_length已经变为负值），我们会通过减少尽可能多的项目中的比率的适当一侧来将超支分配回各组，以使超支回到零。

该distribute_overrun函数包含在下面。

def distribute_overrun(groups,overrun,part):
    i = 0
    while overrun < 0:
      group = list(groups[i])
      group[part] -= 1
      groups[i] = tuple(group)
      overrun += 1
      i += 1

Run Code Online (Sandbox Code Playgroud)

最后，组将是包含每个组的 test_size 和 train_size 的元组列表。

如果这听起来像您想要的东西，但您需要我扩展代码示例，请告诉我。

归档时间：	12 年，7 月前
查看次数：	1581 次
最近记录：	9 年，12 月前