R中计算距离集的大数据方法？

Question

R中计算距离集的大数据方法？

dmc*_*mcd 6 r matrix coordinates bigdata dataframe

问题：我们需要一种大数据方法来计算点之间的距离。我们用五个观察数据框概述了我们想要在下面做什么。然而，当行数变大（> 100 万）时，这种特定方法是不可行的。过去，我们使用 SAS 进行此类分析，但如果可能的话，我们更喜欢使用 R。（注意：我不会展示代码，因为虽然我概述了一种在下面的较小数据集上执行此操作的方法，但这基本上是一种不可能用于我们规模的数据的方法。）

我们从商店的数据帧开始，每个商店都有纬度和经度（尽管这不是空间文件，我们也不想使用空间文件）。

# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

stores
  id x y
1  1 1 1
2  2 0 2
3  3 1 0
4  4 2 2
5  5 0 0

Run Code Online (Sandbox Code Playgroud)

对于每个商店，我们想知道 x 距离内的商店数量。在小数据框中，这很简单。创建所有坐标的另一个数据框，合并回来，计算距离，如果距离小于 x，则创建一个指标，并将指标相加（商店本身的距离为 0，减去 1）。这将产生如下所示的数据集：

   id x y  s1.dist  s2.dist  s3.dist  s4.dist  s5.dist
1:  1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2:  2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3:  3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4:  4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5:  5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000

Run Code Online (Sandbox Code Playgroud)

当您将（任意）低于 1.45 的值算作“收盘价”时，您最终会得到如下所示的指标：

# don't include the store itself in the total
   id x y s1.close s2.close s3.close s4.close s5.close total.close
1:  1 1 1        1        1        1        1        1           4
2:  2 0 2        1        1        0        0        0           1
3:  3 1 0        1        0        1        0        1           2
4:  4 2 2        1        0        0        1        0           1
5:  5 0 0        1        0        1        0        1           2

Run Code Online (Sandbox Code Playgroud)

最终产品应如下所示：

   id total.close
1:  1           4
2:  2           1
3:  3           2
4:  4           1
5:  5           2

Run Code Online (Sandbox Code Playgroud)

感谢所有建议。

非常感谢

Answer 1

Dub*_*kay 1

有什么理由不能循环而不是进行一项大计算？

stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
  sqrt((x1-x0)^2+(y1-y0)^2)
}

# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
  distances <- distfun(x0 = stores$x[i], x1 = stores$x,
                       y0 = stores$y[i], y1 = stores$y)
  # Calculate number less than arbitrary cutoff, subtract one for self
  num_within <- sum(distances<1.45)-1
  c(stores$id[i], num_within)
}))

Run Code Online (Sandbox Code Playgroud)

生产：

     [,1] [,2]
[1,]    1    4
[2,]    2    1
[3,]    3    2
[4,]    4    1
[5,]    5    2

Run Code Online (Sandbox Code Playgroud)

这适用于可以带入 R 的任何大小的数据集，但随着大小的增加，速度会变慢。以下是对 10,000 个条目的测试，该测试在我的计算机上运行几秒钟：

     [,1] [,2]
[1,]    1    4
[2,]    2    1
[3,]    3    2
[4,]    4    1
[5,]    5    2

Run Code Online (Sandbox Code Playgroud)

          [,1] [,2]
    [1,]     1  679
    [2,]     2  698
    [3,]     3  618
    [4,]     4  434
    [5,]     5  402
...
 [9995,]  9995  529
 [9996,]  9996  626
 [9997,]  9997  649
 [9998,]  9998  514
 [9999,]  9999  667
[10000,] 10000  603

Run Code Online (Sandbox Code Playgroud)

计算越多，它就会变得越慢（因为它必须在每对点之间运行，这将始终是 O(n^2)），但如果不知道您想要计算的实际距离度量，我们无法优化缓慢的部分任何进一步。

归档时间：	4 年，1 月前
查看次数：	633 次
最近记录：	4 年，1 月前