R：矢量化循环以创建成对矩阵

Question

R：矢量化循环以创建成对矩阵

jay*_*ayb 15 performance loops r matrix vectorization

我想加速一个用于创建成对矩阵的函数，该矩阵描述在一组位置中在所有其他对象之前和之后选择对象的次数。

这是一个例子df：

  df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
                   Fruit = c("apple", "orange", "pear",
                             "orange", "pear",
                             "pear", "apple",
                             "pear", "apple", "orange",
                             "pear", "apple", "orange"),
                   Order = c(1, 2, 3,
                            1, 2,
                            1, 2, 
                            1, 2, 3,
                            1, 1, 1))

Run Code Online (Sandbox Code Playgroud)

在每一个中Shop，Fruit都是由一个给定的客户挑选的Order。

以下函数创建一个m x n成对矩阵：

loop.function <- function(df){
  
  fruits <- unique(df$Fruit)
  nt <- length(fruits)
  mat <- array(dim=c(nt,nt))
  
  for(m in 1:nt){
    
    for(n in 1:nt){
      
      ## filter df for each pair of fruit
      xm <- df[df$Fruit == fruits[m],]
      xn <- df[df$Fruit == fruits[n],]
      
      ## index instances when a pair of fruit are picked in same shop
      mm <- match(xm$Shop, xn$Shop)
      
      ## filter xm and xn based on mm
      xm <- xm[! is.na(mm),]
      xn <- xn[mm[! is.na(mm)],]
      
      ## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
      mat[m,n] <- sum(xn$Order < xm$Order)
    }
  }
  
  row.names(mat) <- fruits
  colnames(mat) <- fruits
  
  return(mat)
}

Run Code Online (Sandbox Code Playgroud)

哪里mat[m,n]是之后fruits[m]被拣选的次数。并且是之前被采摘的次数。如果同时采摘成对的水果（例如在中），则不会记录。 fruits[n]mat[n,m]fruits[m] fruits[n]Shop E

查看预期输出：

>loop.function(df) apple orange pear apple 0 0 2 orange 2 0 1 pear 1 2 0
Run Code Online (Sandbox Code Playgroud)
您可以在此处看到pear之前apple(inShop C和D)选择了两次，并且apple之前pear(in Shop A)选择了一次。

我正在努力提高我对矢量化的了解，尤其是在代替循环方面，所以我想知道如何对这个循环进行矢量化。

（我感觉可能有使用的解决方案outer()，但我对矢量化函数的了解仍然非常有限。）

更新

看到真实的数据基准times = 10000为loop.function()，tidyverse.function()，loop.function2()，datatable.function()和loop.function.TMS()：

Unit: milliseconds expr min lq mean median uq max neval cld loop.function(dat) 186.588600 202.78350 225.724249 215.56575 234.035750 999.8234 10000 e tidyverse.function(dat) 21.523400 22.93695 26.795815 23.67290 26.862700 295.7456 10000 c loop.function2(dat) 119.695400 126.48825 142.568758 135.23555 148.876100 929.0066 10000 d datatable.function(dat) 8.517600 9.28085 10.644163 9.97835 10.766749 215.3245 10000 b loop.function.TMS(dat) 4.482001 5.08030 5.916408 5.38215 5.833699 77.1935 10000 a
Run Code Online (Sandbox Code Playgroud)
对我来说最有趣的结果可能是tidyverse.function()在真实数据上的表现。我将不得不Rccp在以后尝试添加解决方案 - 我无法让它们处理真实数据。

我感谢大家对这篇文章的兴趣和回答——我的目的是学习和提高性能，从所有给出的评论和解决方案中肯定可以学到很多东西。谢谢！

Answer 1

Wal*_*ldi 10

一个data.table解决方案：

library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
           ,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
      ,Fruit~i.Fruit,value.var='cnt')

    Fruit apple orange pear
1:  apple     0      0    2
2: orange     2      0    1
3:   pear     1      2    0

Run Code Online (Sandbox Code Playgroud)

该Shop指数是没有必要在这个例子中，但可能会提高在更大的数据集性能。

由于该问题对性能提出了许多评论，因此我决定检查Rcpp会带来什么：

library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {

std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n  = FruitLevels.length();

std::string currentShop = "";
int order, fruit, i, f;

NumericMatrix result(n,n);
NumericVector fruitOrder(n);

for (i=0;i<Fruit.length();i++){
    if (currentShop != Shop[i]) {
       //Init counter for each shop
       currentShop = Shop[i];
       std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
    }
    order = Order[i];
    fruit = FruitInt[i];
    fruitOrder[fruit-1] = order;
    for (f=0;f<n;f++) {
       if (order > fruitOrder[f] & fruitOrder[f]>0 ) { 
         result(fruit-1,f) = result(fruit-1,f)+1; 
    }
  }
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')

rcppPair(df)

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

Run Code Online (Sandbox Code Playgroud)

在示例数据集上，这比解决方案快 500 倍以上data.table，可能是因为它没有笛卡尔积问题。这不应该在错误输入时保持稳健，并期望商店/订单按升序排列。

考虑到找到data.table解决方案的 3 行代码所花费的几分钟，与更长的Rcpp解决方案/调试过程相比，我不建议去Rcpp这里，除非存在真正的性能瓶颈。

然而有趣的是要记住，如果性能是必须的，则Rcpp可能值得付出努力。

Answer 2

Col*_*ole 7

这是一种可以进行简单修改以使其速度提高 5 倍的方法。

loop.function2 <- function(df){

    spl_df = split(df[, c(1L, 3L)], df[[2L]])
    
    mat <- array(0L,
                 dim=c(length(spl_df), length(spl_df)),
                 dimnames = list(names(spl_df), names(spl_df)))
    
    for (m in 1:(length(spl_df) - 1L)) {
        xm = spl_df[[m]]
        mShop = xm$Shop
        for (n in ((1+m):length(spl_df))) {
            xn = spl_df[[n]]
            mm = match(mShop, xn$Shop)
            inds = which(!is.na(mm))
            mOrder = xm[inds, "Order"]
            nOrder = xn[mm[inds], "Order"]

            mat[m, n] <- sum(nOrder < mOrder)
            mat[n, m] <- sum(mOrder < nOrder)
        }
    }
    mat
}

Run Code Online (Sandbox Code Playgroud)

主要有3个概念：

原始df[df$Fruits == fruits[m], ]线路效率低下，因为您将进行相同的比较length(Fruits)^2时间。相反，我们可以使用split()这意味着我们只扫描水果一次。
有很多使用df$var它会在每个循环期间提取向量。在这里，我们将分配xm放在内部循环之外，并尝试最小化我们需要子集/提取的内容。
我将其更改为更接近于，combn因为我们可以match()通过同时执行这两项操作sum(xmOrder > xnOrder)，然后将其切换为sum(xmOrder < xnOrder).

表现：

loop.function2 <- function(df){

    spl_df = split(df[, c(1L, 3L)], df[[2L]])
    
    mat <- array(0L,
                 dim=c(length(spl_df), length(spl_df)),
                 dimnames = list(names(spl_df), names(spl_df)))
    
    for (m in 1:(length(spl_df) - 1L)) {
        xm = spl_df[[m]]
        mShop = xm$Shop
        for (n in ((1+m):length(spl_df))) {
            xn = spl_df[[n]]
            mm = match(mShop, xn$Shop)
            inds = which(!is.na(mm))
            mOrder = xm[inds, "Order"]
            nOrder = xn[mm[inds], "Order"]

            mat[m, n] <- sum(nOrder < mOrder)
            mat[n, m] <- sum(mOrder < nOrder)
        }
    }
    mat
}

Run Code Online (Sandbox Code Playgroud)

我的预感是，对于更大的数据集，@Waldi 的data.table解决方案会更快。但是对于较小的数据集，这应该是非常有效的。

最后，这里的另一个RCPP，这似乎是比@Waldi慢的方法：

bench::mark(loop.function(df), loop.function2(df))

# A tibble: 2 x 13
##  expression              min median
##  <bch:expr>         <bch:tm> <bch:>
##1 loop.function(df)    3.57ms 4.34ms
##2 loop.function2(df)  677.2us 858.6us

Run Code Online (Sandbox Code Playgroud)

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
    int x_size = x.size();
    IntegerMatrix ans(x_size, x_size);
    
    for (int m = 0; m < x_size - 1; m++) {
        DataFrame xm = x[m];
        CharacterVector mShop = xm[0];
        IntegerVector mOrder = xm[1];
        int nrows = mShop.size();
        for (int n = m + 1; n < x_size; n++) {
            DataFrame xn = x[n];
            CharacterVector nShop = xn[0];
            IntegerVector nOrder = xn[1];
            for (int i = 0; i < nrows; i++) {
                for (int j = 0; j < nrows; j++) {
                    if (mShop[i] == nShop[j]) {
                        if (mOrder[i] > nOrder[j])
                           ans(m, n)++;
                        else
                            ans(n, m)++;
                        break;
                    }
                }
            }
        }
    }
    return(ans);
}

Run Code Online (Sandbox Code Playgroud)

Answer 3

TMS*_*TMS 5

似乎不可能对原始数据框进行矢量化df。但是，如果您使用reshape2::dcast(),将其转换为每个商店一行：

require(reshape2)

df$Fruit <- as.character(df$Fruit)

by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")

#   Shop apple orange pear
# 1    A     1      2    3
# 2    B    NA      1    2
# 3    C     2     NA    1
# 4    D     2      3    1
# 5    E     1      1    1

Run Code Online (Sandbox Code Playgroud)

...，那么您至少可以轻松地对 [m, n] 的每个组合进行矢量化：

fruits <- unique(df$Fruit)
outer(fruits, fruits, 
    Vectorize(
        function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), 
        c("m", "n")
    ), 
    by_shop)
#      [,1] [,2] [,3]
# [1,]    0    0    2
# [2,]    2    0    1
# [3,]    1    2    0

Run Code Online (Sandbox Code Playgroud)

这可能是您想要使用的解决方案outer。更快的解决方案是对所有水果 [m, n] 组合进行真正的矢量化，但我一直在考虑它，但我看不到任何方法。所以我不得不使用这个Vectorize函数，这当然比真正的矢量化要慢得多。

与原始函数的基准比较：

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
     loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923   100
 loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813   100

Run Code Online (Sandbox Code Playgroud)

功能和基准代码（还添加了dimnames 的保留）：

require(reshape2)   
loop.function.TMS <- function(df) { 
    df$Fruit <- as.character(df$Fruit)
    by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
    fruits <- unique(df$Fruit)
    o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
    colnames(o) <- rownames(o) <- fruits
    o
}

require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，11 月前
查看次数：	620 次
最近记录：	5 年，10 月前