Ujj*_*wal 7 python machine-learning sha tensorflow
我试图理解这种分裂方法的逻辑过程.
我们可以从中获得多大的分割质量?这是拆分数据集的推荐方法吗?
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
result[label_name] = {
'dir': dir_name,
'training': training_images,
'testing': testing_images,
'validation': validation_images,
}
Run Code Online (Sandbox Code Playgroud)此代码只是将文件名 \xe2\x80\x9crandomly\xe2\x80\x9d (但可重复地)分布在多个 bin 上,然后将 bin 分为三个类别。散列中的位数无关紧要(只要它 \xe2\x80\x99s \xe2\x80\x9cenough\xe2\x80\x9d,对于此类工作来说可能约为 35)。
\n\n减少模n +1 会产生 [0, n ]上的值,而将其乘以 100/ n显然会产生 [0,100] 上的值,该值被解释为百分比。 n的目的MAX_NUM_IMAGES_PER_CLASS是控制解释中的舍入误差不超过 \xe2\x80\x9cone image\xe2\x80\x9d。
这种策略是合理的,但看起来比实际情况要复杂一些(因为仍在进行舍入,并且余数引入了偏差\xe2\x80\x94,尽管数字这么大,这是完全无法观察到的)。您可以通过简单地预先计算每个类的 2^160 个哈希值的整个空间的范围,然后根据两个边界检查哈希值,从而使其更简单、更准确。理论上,这仍然涉及舍入,但对于 160 位,它\xe2\x80\x99 仅是表示小数(如浮点数中的 31%)的固有值。
\n| 归档时间: |
|
| 查看次数: |
469 次 |
| 最近记录: |