样品无需更换

Question

样品无需更换

如何在 TensorFlow 中进行无替换采样？就像numpy.random.choice(n, size=k, replace=False)一些非常大的整数n（例如 100k-100M）和更小的整数k（例如 100-10k）。另外，我希望它是有效的，并在GPU上，因此像其他的解决方案，这与tf.py_func是不是真的我的选择。任何会使用的东西tf.range(n)也不是一种选择，因为n可能非常大。

Answer 1

jde*_*esa 6

这是一种方法：

\n\n

n = ...\nsample_size = ...\nidx = tf.random_shuffle(tf.range(n))[:sample_size]\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

编辑：

\n\n

I had posted the answer below but then read the last line of your post. I don\'t think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:

\n\n

n = ...\nsample_size = ...\nidx = tf.zeros(sample_size, dtype=tf.int64)\nidx = tf.while_loop(\n    lambda i: tf.size(idx) == tf.size(tf.unique(idx)),\n    lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

EDIT 2:

\n\n

About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k \xe2\x89\xa4 n), the probability that an iteration is successful is:

\n\n

p = product((n - (i - 1) / n) for i in 1 .. k)

\n\n

Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:

\n\n

def avg_iter(k, n):\n    if k > n or n <= 0 or k < 0:\n        raise ValueError()\n    avg_it = 1.0\n    for p in (float(n) / (n - i) for i in range(k)):\n        avg_it *= p\n    return avg_it\n

Run Code Online (Sandbox Code Playgroud)\n\n

And here are some results:

\n\n

+-------+------+----------+\n|   n   |  k   | Avg iter |\n+-------+------+----------+\n|    10 |    5 | 3.3      |\n|   100 |   10 | 1.6      |\n|  1000 |   10 | 1.1      |\n|  1000 |  100 | 167.8    |\n| 10000 |   10 | 1.0      |\n| 10000 |  100 | 1.6      |\n| 10000 | 1000 | 2.9e+22  |\n+-------+------+----------+\n

Run Code Online (Sandbox Code Playgroud)\n\n

You can see it varies wildy depending on the parameters.

\n\n

尽管我能想到的唯一算法是 O( k ² )，但可以以固定数量的步骤构造向量。在纯 Python 中，它是这样的：

\n\n

import random\n\ndef sample_wo_replacement(n, k):\n    sample = [0] * k\n    for i in range(k):\n        sample[i] = random.randint(0, n - 1 - len(sample))\n    for i, v in reversed(list(enumerate(sample))):\n        for p in reversed(sample[:i]):\n            if v >= p:\n                v += 1\n        sample[i] = v\n    return sample\n\nrandom.seed(100)\nprint(sample_wo_replacement(10, 5))\n# [2, 8, 9, 7, 1]\nprint(sample_wo_replacement(10, 10))\n# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]\n

Run Code Online (Sandbox Code Playgroud)\n\n

这是在 TensorFlow 中实现这一点的一种可能方法（不确定是否是最好的方法）：

\n\n

import tensorflow as tf\n\ndef sample_wo_replacement_tf(n, k):\n    # First loop\n    sample = tf.constant([], dtype=tf.int64)\n    i = 0\n    sample, _ = tf.while_loop(\n        lambda sample, i: i < k,\n        # This is ugly but I did not want to define more functions\n        lambda sample, i: (tf.concat([sample,\n                                      tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],\n                                     axis=0),\n                           i + 1),\n        [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])\n    # Second loop\n    def inner_loop(sample, i):\n        sample_size = tf.shape(sample)[0]\n        v = sample[i]\n        j = i - 1\n        v, _ = tf.while_loop(\n            lambda v, j: j >= 0,\n            lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),\n            [v, j])\n        return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)\n    i = tf.shape(sample)[0] - 1\n    sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])\n    return sample\n

Run Code Online (Sandbox Code Playgroud)\n\n

举个例子：

\n\n

with tf.Graph().as_default(), tf.Session() as sess:\n    tf.set_random_seed(100)\n    sample = sample_wo_replacement_tf(10, 5)\n    for i in range(10):\n        print(sess.run(sample))\n# [3 0 6 8 4]\n# [5 4 8 9 3]\n# [1 4 0 6 8]\n# [8 9 5 6 7]\n# [7 5 0 2 4]\n# [8 4 5 3 7]\n# [0 5 7 4 3]\n# [2 0 3 8 6]\n# [3 4 8 5 1]\n# [5 7 0 2 9]\n

Run Code Online (Sandbox Code Playgroud)\n\n

这是相当密集的tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn\'t know how fast can you really get with this method without some kind of benchmarking.

\n\n

编辑4：

\n\n

最后一种可能的方法。您可以将可能值的范围（0 到n ）划分为大小为c的“块” ，并从每个块中选择随机数量的数字，然后对所有内容进行洗牌。您使用的内存量受c限制，并且不需要嵌套循环。如果n可以被c整除，那么您应该得到完美的随机分布，否则最后一个“短”块中的值将获得一些额外的概率（根据情况，这可能可以忽略不计）。这是一个 NumPy 实现。考虑不同的极端情况和陷阱有点长，但如果c \xe2\x89\xa5 k和n mod c = 0 几个部分就会得到简化。

\n\n

import numpy as np\n\ndef sample_chunked(n, k, chunk=None):\n    chunk = chunk or n\n    last_chunk = chunk\n    parts = n // chunk\n    # Distribute k among chunks\n    max_p = min(float(chunk) / k, 1.0)\n    max_p_last = max_p\n    if n % chunk != 0:\n        parts += 1\n        last_chunk = n % chunk\n        max_p_last = min(float(last_chunk) / k, 1.0)\n    p = np.full(parts, 2)\n    # Iterate until a valid distribution is found\n    while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:\n        p = np.random.uniform(size=parts)\n        p /= np.sum(p)\n    dist = (k * p).astype(np.int64)\n    sample_size = np.sum(dist)\n    # Account for rounding errors\n    while sample_size < k:\n        i = np.random.randint(len(dist))\n        while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):\n            i = np.random.randint(len(dist))\n        dist[i] += 1\n        sample_size += 1\n    while sample_size > k:\n        i = np.random.randint(len(dist))\n        while dist[i] == 0:\n            i = np.random.randint(len(dist))\n        dist[i] -= 1\n        sample_size -= 1\n    assert sample_size == k\n    # Generate sample parts\n    sample_parts = []\n    for i, v in enumerate(np.nditer(dist)):\n        if v <= 0:\n            continue\n        c = chunk if i < parts - 1 else last_chunk\n        base = chunk * i\n        sample_parts.append(base + np.random.choice(c, v, replace=False))\n    sample = np.concatenate(sample_parts, axis=0)\n    np.random.shuffle(sample)\n    return sample\n\nnp.random.seed(100)\nprint(sample_chunked(15, 5, 4))\n# [ 8  9 12 13  3]\n

Run Code Online (Sandbox Code Playgroud)\n\n

在我的计算机上进行快速基准测试sample_chunked(100000000, 100000, 100000)大约需要 3.1 秒，而我无法使用sample_wo_replacement相同的参数运行之前的算法（上面的函数）来完成。应该可以在 TensorFlow 中实现它，也许可以使用tf.TensorArray，尽管需要付出巨大的努力才能完全正确。

\n

归档时间：	7 年，8 月前
查看次数：	1724 次
最近记录：	7 年前