从 numpy 距离数组中提取 N 个最近的对

Question

从 numpy 距离数组中提取 N 个最近的对

rob*_*anf 3 python arrays numpy distance

我有一个大型的对称二维距离数组。我想获得最接近的 N 对观察结果。

该数组存储为一个 numpy 压缩数组，并且具有 1 亿个观测值的数量级。

这是一个在较小阵列上获得 100 个最近距离的示例（~500k 观察），但它比我想要的要慢得多。

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

Run Code Online (Sandbox Code Playgroud)

在我看来，必须有更快的方法来使用标准的 numpy 或 scipy 函数来做到这一点，但我很难过。

注意如果很多对是等距的，那没关系，在这种情况下我不关心它们的顺序。

Answer 1

YXD*_*YXD 5

您不需要ti在每次调用condensed_to_square_index. 这是一个只计算一次的基本修改：

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

Run Code Online (Sandbox Code Playgroud)

您还可以矢量化以下内容的创建r：

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

Run Code Online (Sandbox Code Playgroud)

或者

r = np.vstack(ti)[:, closest] + 1

Run Code Online (Sandbox Code Playgroud)

传奇。谢谢。我应该注意到计算 ti 节省的时间。我的笔记本电脑上的一些时间：我的慢速方法：每个循环 2.94s 你的第一个方法：每个循环 74.2 ms 使用 zip for r：每个循环 70.7 ms 使用 vstack for r：每个循环 70.5 ms (2认同)

归档时间：	11 年，9 月前
查看次数：	2415 次
最近记录：	11 年，9 月前