SHA Hashing用于培训/验证/测试集拆分

Question

SHA Hashing用于培训/验证/测试集拆分

Ujj*_*wal 7 python machine-learning sha tensorflow

我试图理解这种分裂方法的逻辑过程.

SHA1编码是十六进制的40个字符.在表达式中计算了什么样的概率？
(MAX_NUM_IMAGES_PER_CLASS + 1)是什么原因？为什么要加1？
为MAX_NUM_IMAGES_PER_CLASS设置不同的值会对分割质量产生影响吗？

我们可以从中获得多大的分割质量？这是拆分数据集的推荐方法吗？

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dav*_*ing 4

此代码只是将文件名 \xe2\x80\x9crandomly\xe2\x80\x9d （但可重复地）分布在多个 bin 上，然后将 bin 分为三个类别。散列中的位数无关紧要（只要它 \xe2\x80\x99s \xe2\x80\x9cenough\xe2\x80\x9d，对于此类工作来说可能约为 35）。

\n\n

减少模n +1 会产生 [0, n ]上的值，而将其乘以 100/ n显然会产生 [0,100] 上的值，该值被解释为百分比。 n的目的MAX_NUM_IMAGES_PER_CLASS是控制解释中的舍入误差不超过 \xe2\x80\x9cone image\xe2\x80\x9d。

\n\n

这种策略是合理的，但看起来比实际情况要复杂一些（因为仍在进行舍入，并且余数引入了偏差\xe2\x80\x94，尽管数字这么大，这是完全无法观察到的）。您可以通过简单地预先计算每个类的 2^160 个哈希值的整个空间的范围，然后根据两个边界检查哈希值，从而使其更简单、更准确。理论上，这仍然涉及舍入，但对于 160 位，它\xe2\x80\x99 仅是表示小数（如浮点数中的 31%）的固有值。

\n

归档时间：	8 年，10 月前
查看次数：	469 次
最近记录：	6 年，7 月前