这是一种方法:
\n\nn = ...\nsample_size = ...\nidx = tf.random_shuffle(tf.range(n))[:sample_size]\n
Run Code Online (Sandbox Code Playgroud)\n\n编辑:
\n\nI had posted the answer below but then read the last line of your post. I don\'t think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice
with replace=False
is also implemented as a slice of a permutation). You could resort to a tf.while_loop
until you have unique indices:
n = ...\nsample_size = ...\nidx = tf.zeros(sample_size, dtype=tf.int64)\nidx = tf.while_loop(\n lambda i: tf.size(idx) == tf.size(tf.unique(idx)),\n lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))\n
Run Code Online (Sandbox Code Playgroud)\n\nEDIT 2:
\n\nAbout the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k \xe2\x89\xa4 n), the probability that an iteration is successful is:
\n\np = product((n - (i - 1) / n) for i in 1 .. k)
\n\nSince each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:
\n\ndef avg_iter(k, n):\n if k > n or n <= 0 or k < 0:\n raise ValueError()\n avg_it = 1.0\n for p in (float(n) / (n - i) for i in range(k)):\n avg_it *= p\n return avg_it\n
Run Code Online (Sandbox Code Playgroud)\n\nAnd here are some results:
\n\n+-------+------+----------+\n| n | k | Avg iter |\n+-------+------+----------+\n| 10 | 5 | 3.3 |\n| 100 | 10 | 1.6 |\n| 1000 | 10 | 1.1 |\n| 1000 | 100 | 167.8 |\n| 10000 | 10 | 1.0 |\n| 10000 | 100 | 1.6 |\n| 10000 | 1000 | 2.9e+22 |\n+-------+------+----------+\n
Run Code Online (Sandbox Code Playgroud)\n\nYou can see it varies wildy depending on the parameters.
\n\n尽管我能想到的唯一算法是 O( k 2 ),但可以以固定数量的步骤构造向量。在纯 Python 中,它是这样的:
\n\nimport random\n\ndef sample_wo_replacement(n, k):\n sample = [0] * k\n for i in range(k):\n sample[i] = random.randint(0, n - 1 - len(sample))\n for i, v in reversed(list(enumerate(sample))):\n for p in reversed(sample[:i]):\n if v >= p:\n v += 1\n sample[i] = v\n return sample\n\nrandom.seed(100)\nprint(sample_wo_replacement(10, 5))\n# [2, 8, 9, 7, 1]\nprint(sample_wo_replacement(10, 10))\n# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]\n
Run Code Online (Sandbox Code Playgroud)\n\n这是在 TensorFlow 中实现这一点的一种可能方法(不确定是否是最好的方法):
\n\nimport tensorflow as tf\n\ndef sample_wo_replacement_tf(n, k):\n # First loop\n sample = tf.constant([], dtype=tf.int64)\n i = 0\n sample, _ = tf.while_loop(\n lambda sample, i: i < k,\n # This is ugly but I did not want to define more functions\n lambda sample, i: (tf.concat([sample,\n tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],\n axis=0),\n i + 1),\n [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])\n # Second loop\n def inner_loop(sample, i):\n sample_size = tf.shape(sample)[0]\n v = sample[i]\n j = i - 1\n v, _ = tf.while_loop(\n lambda v, j: j >= 0,\n lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),\n [v, j])\n return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)\n i = tf.shape(sample)[0] - 1\n sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])\n return sample\n
Run Code Online (Sandbox Code Playgroud)\n\n举个例子:
\n\nwith tf.Graph().as_default(), tf.Session() as sess:\n tf.set_random_seed(100)\n sample = sample_wo_replacement_tf(10, 5)\n for i in range(10):\n print(sess.run(sample))\n# [3 0 6 8 4]\n# [5 4 8 9 3]\n# [1 4 0 6 8]\n# [8 9 5 6 7]\n# [7 5 0 2 4]\n# [8 4 5 3 7]\n# [0 5 7 4 3]\n# [2 0 3 8 6]\n# [3 4 8 5 1]\n# [5 7 0 2 9]\n
Run Code Online (Sandbox Code Playgroud)\n\n这是相当密集的tf.while_loop
s, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn\'t know how fast can you really get with this method without some kind of benchmarking.
编辑4:
\n\n最后一种可能的方法。您可以将可能值的范围(0 到n )划分为大小为c的“块” ,并从每个块中选择随机数量的数字,然后对所有内容进行洗牌。您使用的内存量受c限制,并且不需要嵌套循环。如果n可以被c整除,那么您应该得到完美的随机分布,否则最后一个“短”块中的值将获得一些额外的概率(根据情况,这可能可以忽略不计)。这是一个 NumPy 实现。考虑不同的极端情况和陷阱有点长,但如果c \xe2\x89\xa5 k和n mod c = 0 几个部分就会得到简化。
\n\nimport numpy as np\n\ndef sample_chunked(n, k, chunk=None):\n chunk = chunk or n\n last_chunk = chunk\n parts = n // chunk\n # Distribute k among chunks\n max_p = min(float(chunk) / k, 1.0)\n max_p_last = max_p\n if n % chunk != 0:\n parts += 1\n last_chunk = n % chunk\n max_p_last = min(float(last_chunk) / k, 1.0)\n p = np.full(parts, 2)\n # Iterate until a valid distribution is found\n while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:\n p = np.random.uniform(size=parts)\n p /= np.sum(p)\n dist = (k * p).astype(np.int64)\n sample_size = np.sum(dist)\n # Account for rounding errors\n while sample_size < k:\n i = np.random.randint(len(dist))\n while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):\n i = np.random.randint(len(dist))\n dist[i] += 1\n sample_size += 1\n while sample_size > k:\n i = np.random.randint(len(dist))\n while dist[i] == 0:\n i = np.random.randint(len(dist))\n dist[i] -= 1\n sample_size -= 1\n assert sample_size == k\n # Generate sample parts\n sample_parts = []\n for i, v in enumerate(np.nditer(dist)):\n if v <= 0:\n continue\n c = chunk if i < parts - 1 else last_chunk\n base = chunk * i\n sample_parts.append(base + np.random.choice(c, v, replace=False))\n sample = np.concatenate(sample_parts, axis=0)\n np.random.shuffle(sample)\n return sample\n\nnp.random.seed(100)\nprint(sample_chunked(15, 5, 4))\n# [ 8 9 12 13 3]\n
Run Code Online (Sandbox Code Playgroud)\n\n在我的计算机上进行快速基准测试sample_chunked(100000000, 100000, 100000)
大约需要 3.1 秒,而我无法使用sample_wo_replacement
相同的参数运行之前的算法(上面的函数)来完成。应该可以在 TensorFlow 中实现它,也许可以使用tf.TensorArray
,尽管需要付出巨大的努力才能完全正确。
归档时间: |
|
查看次数: |
1724 次 |
最近记录: |