Python：从一堆点中选择分布更好的n个点

Question

Python：从一堆点中选择分布更好的n个点

Jos*_*sch 3 python numpy scipy

我在 XY 平面中有一个 numpy 点数组，例如：

我想从所有这些点中选择更好地分布的 n 个点（比如 100 个）。也就是说，我希望点的密度在任何地方都是恒定的。

像这样的东西：

在此处输入图片说明

是否有任何 pythonic 方式或任何 numpy/scipy 函数来做到这一点？

Answer 1

Joe*_*ton 5

@EMS 非常正确，您应该仔细考虑您想要什么。

有更复杂的方法来做到这一点（EMS 的建议非常好！），但一种蛮力的方法是将这些点分箱到一个规则的矩形网格上，并从每个分箱中随机抽取一个点。

主要的缺点是你不会得到你要求的分数。相反，您会得到一些小于该数字的数字。

一些创造性的索引pandas使这种“网格化”方法非常容易，尽管您当然也可以使用“纯”numpy 来做到这一点。

作为最简单的、蛮力的网格方法的示例：（这里有很多我们可以做得更好的地方。）

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))

# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000

# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)

# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])

# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

松散地基于@EMS 在评论中的建议，这是另一种方法。

我们将使用核密度估计来计算点的密度，然后使用它的倒数作为选择给定点的概率。

scipy.stats.gaussian_kde未针对此用例（或通常针对大量点）进行优化。瓶颈就在这里。可以通过多种方式（近似值、成对距离的特殊情况等）为这个特定用例编写更优化的版本。但是，这超出了这个特定问题的范围。请注意，对于这个具有 1e5 点的特定示例，运行需要一两分钟。

这种方法的优点是您可以获得所需的确切点数。缺点是您可能拥有选定点的本地集群。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

归档时间：	11 年，11 月前
查看次数：	2020 次
最近记录：	7 年，8 月前