Rya*_*axe 11 python numpy function
我有一些数据,我想把它分成更小的组,保持一个共同的比例.我写了一个函数,它将输入两个数组并计算大小比例,然后告诉我可以将它分成多少组的选项(如果所有组的大小相同),这里是函数:
def cross_validation_group(train_data, test_data):
import numpy as np
from calculator import factors
test_length = len(test_data)
train_length = len(train_data)
total_length = test_length + train_length
ratio = test_length/float(total_length)
possibilities = factors(total_length)
print possibilities
print possibilities[len(possibilities)-1] * ratio
super_count = 0
for i in possibilities:
if i < len(possibilities)/2:
pass
else:
attempt = float(i * ratio)
if attempt.is_integer():
print str(i) + " is an option for total size with " + str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
else:
pass
folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
if folds != 0:
total_size = total_length/folds
test_size = float(total_size * ratio)
train_size = total_size - test_size
columns = train_data[0]
columns= len(columns)
groups = np.empty((folds,(test_size + train_size),columns))
i = 0
a = 0
b = 0
for j in range (0,folds):
test_size_new = test_size * (j + 1)
train_size_new = train_size * j
total_size_new = (train_size + test_size) * (j + 1)
cut_off = total_size_new - train_size
p = 0
while i < total_size_new:
if i < cut_off:
groups[j,p] = test_data[a]
a += 1
else:
groups[j,p] = train_data[b]
b += 1
i += 1
p += 1
return groups
else:
print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"
Run Code Online (Sandbox Code Playgroud)
所以我的问题是如何才能使它成为函数的第三个输入,它将是折叠的数量并改变函数,而不是迭代以确保每个组具有相同的数量和正确的比率,它会有正确的比例,但大小不一?
@JamesHolderness的补充
所以你的方法几乎是完美的,但这里有一个问题:
长度为357和143,有9折,这是返回列表:
[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]
Run Code Online (Sandbox Code Playgroud)
现在,当你添加列时,你会得到这个: 351 144
351很好,因为它不到357,但144不起作用,因为它大于143!原因是357和143是数组的长度,因此该数组的第144行不存在......
我认为这是一个可能适合您的算法。
您将 test_length 和 train_length 除以它们的 GCD,得到简单分数的比率。您取分子和分母,然后将它们加在一起,这就是您的组的大小因子。
例如,如果比率为 3:2,则每个组的大小必须是 5 的倍数。
然后,您获取总长度并将其除以折叠数以获得第一组的理想大小,该大小很可能是浮点数。你找到小于或等于该值的最大 5 倍数,这就是你的第一组。
从总数中减去该值,然后除以 Fold-1 以获得下一组的理想大小。再次找到 5 的最大倍数,从总数中减去 ,然后继续计算,直到计算完所有组。
一些示例代码:
total_length = test_length + train_length
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple
# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
total_multiple = total_length/folds
test_multiple = int(round(float(test_length)*total_multiple/total_length))
train_multiple = total_multiple - test_multiple
groups = []
for i in range(folds,0,-1):
float_size = float(total_length)/i
int_size = int(float_size/total_multiple)*total_multiple
test_size = int_size*test_multiple/total_multiple
train_size = int_size*train_multiple/total_multiple
test_length -= test_size # keep track of the test data used
train_length -= train_size # keep track of the train data used
total_length -= int_size
groups.append((test_size,train_size))
# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)
Run Code Online (Sandbox Code Playgroud)
它已更新,以跟踪每个组(测试和训练)使用的大小,但不用担心我们最初使用太多。
然后最后,如果有任何超支(即test_length
或train_length
已经变为负值),我们会通过减少尽可能多的项目中的比率的适当一侧来将超支分配回各组,以使超支回到零。
该distribute_overrun
函数包含在下面。
def distribute_overrun(groups,overrun,part):
i = 0
while overrun < 0:
group = list(groups[i])
group[part] -= 1
groups[i] = tuple(group)
overrun += 1
i += 1
Run Code Online (Sandbox Code Playgroud)
最后,组将是包含每个组的 test_size 和 train_size 的元组列表。
如果这听起来像您想要的东西,但您需要我扩展代码示例,请告诉我。