任何更快的方法来检查列表中的列表是否等效？

Tho*_*ing 9 algorithm performance if-statement r list

这里我有1:7四个不同分区的整数，即 {1}、{2,3,4}、{5,6} 和 {7}，这些分区写在一个列表中，即list(1,c(2,3,4),c(5,6),7). 我将分区视为集合，这样一个分区内元素的不同排列应该被识别为相同的排列。例如，list(1,c(2,3,4),c(5,6),7)和list(7,1,c(2,3,4),c(6,5))是等价的。

请注意，列表中的元素没有重复，例如 no list(c(1,2),c(2,1),c(1,2))，因为这个问题是在讨论整个集合上的独占分区。

我在列表中列出了一些不同的排列lst，如下所示

lst <- list(list(1,c(2,3,4),c(5,6),7),
            list(c(2,3,4),1,7,c(5,6)),
            list(1,c(2,3,4),7,c(6,5)),
            list(7,1,c(3,2,4),c(5,6)))

Run Code Online (Sandbox Code Playgroud)

我想要做的是验证所有排列都是等效的。如果是，那么我们得到 result TRUE。

我所做的，到目前为止是每个分区中的元素进行排序，并使用setdiff()与interset()和union()判断它（见下面我的代码）

s <- Map(function(v) Map(sort,v),lst)
equivalent <- length(setdiff(Reduce(union,s),Reduce(intersect,s),))==0

Run Code Online (Sandbox Code Playgroud)

但是，我猜当分区大小增加时，这种方法会很慢。有没有更快的方法来制作它？提前欣赏！

一些测试用例（小数据）

# should return `TRUE`
lst1 <- list(list(1,c(2,3,4),c(5,6)),
            list(c(2,3,4),1,c(5,6)),
            list(1,c(2,3,4),c(6,5)))

# should return `TRUE`
lst2 <- list(list(1:2, 3:4), list(3:4, 1:2))

# should return `FALSE`
lst3 <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))

Run Code Online (Sandbox Code Playgroud)

如果没有以rcpp为特色的解决方案，则R有关fast和任何变体的帖子都不完整。

为了最大限度地提高效率，选择正确的数据结构将是最重要的。我们的数据结构需要存储唯一值并且还需要快速插入/访问。这正是std::unordered_set 所体现的。我们只需要确定如何唯一地识别每个vectorunordered integers。

输入算术基本定理

FTA 规定，每个数都可以由素数的乘积唯一地表示（直到因子的顺序）。

这是一个示例，演示我们如何使用 FTA 快速破译两个向量是否按顺序等效（注意P下面是素数列表... (2, 3, 5, 7, 11, etc.)：

                   Maps to                    Maps to              product
vec1 = (1, 2, 7)    -->>    P[1], P[2], P[7]   --->>   2,  3, 17     -->>   102
vec2 = (7, 3, 1)    -->>    P[7], P[3], P[1]   --->>  17,  5,  2     -->>   170
vec3 = (2, 7, 1)    -->>    P[2], P[7], P[1]   --->>   3, 17,  2     -->>   102

Run Code Online (Sandbox Code Playgroud)

由此，我们看到vec1并vec3正确映射到相同的数字，而vec2被映射到不同的值。

由于我们的实际向量可能包含多达一百个小于 1000 的整数，因此应用 FTA 将产生非常大的数字。我们可以利用对数乘积规则来解决这个问题：

log _b (xy) = log _b (x) + log _b (y)

有了这个，我们将能够处理更大的数字示例（这在极大的示例上开始恶化）。

首先，我们需要一个简单的素数生成器（注意，我们实际上是在生成每个素数的对数）。

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

void getNPrimes(std::vector<double> &logPrimes) {
    
    const int n = logPrimes.size();
    const int limit = static_cast<int>(2.0 * static_cast<double>(n) * std::log(n));
    std::vector<bool> sieve(limit + 1, true);
    
    int lastP = 3;
    const int fsqr = std::sqrt(static_cast<double>(limit));
    
    while (lastP <= fsqr) {
        for (int j = lastP * lastP; j <= limit; j += 2 * lastP)
            sieve[j] = false;
        
        int ind = 2;
        
        for (int k = lastP + 2; !sieve[k]; k += 2)
            ind += 2;
        
        lastP += ind;
    }
    
    logPrimes[0] = std::log(2.0);
    
    for (int i = 3, j = 1; i <= limit && j < n; i += 2)
        if (sieve[i])
            logPrimes[j++] = std::log(static_cast<double>(i));
}

Run Code Online (Sandbox Code Playgroud)

这是主要的实现：

// [[Rcpp::export]]
bool f_Rcpp_Hash(List x) {
    
    List tempLst = x[0];
    const int n = tempLst.length();
    int myMax = 0;

    // Find the max so we know how many primes to generate
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        const int tempMax = *std::max_element(v.cbegin(), v.cend());
        
        if (tempMax > myMax)
            myMax = tempMax;
    }
    
    std::vector<double> logPrimes(myMax + 1, 0.0);
    getNPrimes(logPrimes);
    double sumMax = 0.0;
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        if (mySum > sumMax)
            sumMax = mySum;
    }

    // Since all of the sums will be double values and we want to
    // ensure that they are compared with scrutiny, we multiply
    // each sum by a very large integer to bring the decimals to
    // the right of the zero and then convert them to an integer.
    // E.g. Using the example above v1 = (1, 2, 7) & v2 = (7, 3, 1)
    //              
    //    sum of log of primes for v1 = log(2) + log(3) + log(17)
    //                               ~= 4.62497281328427
    //
    //    sum of log of primes for v2 = log(17) + log(5) + log(2)
    //                               ~= 5.13579843705026
    //    
    //    multiplier = floor(.Machine$integer.max / 5.13579843705026)
    //    [1] 418140173
    //    
    // Now, we multiply each sum and convert to an integer
    //    
    //    as.integer(4.62497281328427 * 418140173)
    //    [1] 1933886932    <<--   This is the key for v1
    //
    //    as.integer(5.13579843705026 * 418140173)
    //    [1] 2147483646    <<--   This is the key for v2
    
    const uint64_t multiplier = std::numeric_limits<int>::max() / sumMax;
    std::unordered_set<uint64_t> canon;
    canon.reserve(n);
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        canon.insert(static_cast<uint64_t>(multiplier * mySum));
    }
    
    const auto myEnd = canon.end();
    
    for (auto it = x.begin() + 1; it != x.end(); ++it) {
        List tempLst = *it;
        
        if (tempLst.length() != n)
            return false;

        for (int j = 0; j < n; ++j) {
            IntegerVector v = tempLst[j];
            double mySum = 0.0;
            
            for (auto k: v)
                mySum += logPrimes[k];
            
            const uint64_t key = static_cast<uint64_t>(multiplier * mySum);
            
            if (canon.find(key) == myEnd)
                return false;
        }
    }
    
    return true;
}

Run Code Online (Sandbox Code Playgroud)

以下是lst1, lst2, lst3, & lst (the large one)@GKi应用时的结果。

f_Rcpp_Hash(lst)
[1] TRUE

f_Rcpp_Hash(lst1)
[1] TRUE

f_Rcpp_Hash(lst2)
[1] FALSE

f_Rcpp_Hash(lst3)
[1] FALSE

Run Code Online (Sandbox Code Playgroud)

这里有一些units参数设置为的基准relative。

microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst3)
               , f_chinsoon12(lst3)
               , f_GKi_6a(lst3)
               , f_GKi_6b(lst3)
               , f_Rcpp_Hash(lst3))
Unit: relative
                 expr       min        lq      mean    median        uq       max neval
f_ThomsIsCoding(lst3) 84.882393 63.541468 55.741646 57.894564 56.732118 33.142979    10
   f_chinsoon12(lst3) 31.984571 24.320220 22.148787 22.393368 23.599284 15.211029    10
       f_GKi_6a(lst3)  7.207269  5.978577  5.431342  5.761809  5.852944  3.439283    10
       f_GKi_6b(lst3)  7.399280  5.751190  6.350720  5.484894  5.893290  8.035091    10
    f_Rcpp_Hash(lst3)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    10


microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst)
               , f_chinsoon12(lst)
               , f_GKi_6a(lst)
               , f_GKi_6b(lst)
               , f_Rcpp_Hash(lst))
Unit: relative
                expr        min         lq       mean     median        uq       max neval
f_ThomsIsCoding(lst) 199.776328 202.318938 142.909407 209.422530 91.753335 85.090838    10
   f_chinsoon12(lst)   9.542780   8.983248   6.755171   9.766027  4.903246  3.834358    10
       f_GKi_6a(lst)   3.169508   3.158366   2.555443   3.731292  1.902140  1.649982    10
       f_GKi_6b(lst)   2.992992   2.943981   2.019393   3.046393  1.315166  1.069585    10
    f_Rcpp_Hash(lst)   1.000000   1.000000   1.000000   1.000000  1.000000  1.000000    10

Run Code Online (Sandbox Code Playgroud)

在更大的例子中，比最快的解决方案快大约3 倍。

这是什么意思？

对我来说，这个结果充分说明了base R@GKi、@chinsoon12、@Gregor、@ThomasIsCoding等所展示的美感和效率。我们写了大约 100 行非常具体的代码C++以获得适度的加速。公平地说，base R解决方案最终会调用大部分已编译的代码，并最终像我们上面所做的那样使用哈希表。

非常感谢您的贡献！你的工作很棒！ (2认同)

排序后，您可以使用duplicated和all。

s <- lapply(lst, function(x) lapply(x, sort)) #Sort vectors
s <- lapply(s, function(x) x[order(vapply(x, "[", 1, 1))]) #Sort lists
all(duplicated(s)[-1]) #Test if there are all identical
#length(unique(s)) == 1 #Alternative way to test if all are identical

Run Code Online (Sandbox Code Playgroud)

替代方法：在一个循环中排序

s <- lapply(lst, function(x) {
  tt <- lapply(x, sort)
  tt[order(vapply(tt, "[", 1, 1))]
})
all(duplicated(s)[-1])

Run Code Online (Sandbox Code Playgroud)

替代方案：在循环期间排序并允许提前退出

s <- lapply(lst[[1]], sort)
s <- s[order(vapply(s, "[", 1, 1))]
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  x <- x[order(vapply(x, "[", 1, 1))]
  if(!identical(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

Run Code Online (Sandbox Code Playgroud)

或使用 setequal

s <- lapply(lst[[1]], sort)
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  if(!setequal(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

Run Code Online (Sandbox Code Playgroud)

或者稍微改进@chinsoon12的想法，用向量交换列表！

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  x <- rep(seq_along(x), lengths(x))[order(unlist(x))]
  if(!identical(s, x)) {tt <- FALSE; break;}
}
tt

Run Code Online (Sandbox Code Playgroud)

或避免第二个 order

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
y <- s
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  y <- y[0]
  y[unlist(x)] <- rep(seq_along(x), lengths(x))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

Run Code Online (Sandbox Code Playgroud)

或order与match（或fmatch）交换

x <- lst[[1]]
s <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
s <- match(s, unique(s))
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]]
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  y <- match(y, unique(y))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

Run Code Online (Sandbox Code Playgroud)

或者没有提前退出。

s <- lapply(lst, function(x) {
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  match(y, unique(y))
})
all(duplicated(s)[-1])

Run Code Online (Sandbox Code Playgroud)

或用 C++ 编写

sourceCpp(code = "#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
bool f_GKi_6_Rcpp(const List &x) {
  const List &x0 = x[0];
  const unsigned int n = x0.length();
  unsigned int nn = 0;
  for (List const &i : x0) {nn += i.length();}
  std::vector<int> s(nn);
  for (unsigned int i=0; i<n; ++i) {
    const IntegerVector &v = x0[i];
    for (int const &j : v) {
      if(j > nn) return false;
      s[j-1] = i;
    }
  }
  {
    std::vector<int> lup(n, -1);
    int j = 0;
    for(int &i : s) {
      if(lup[i] < 0) {lup[i] = j++;}
      i = lup[i];
    }
  }
  for (List const &i : x) {
    if(i.length() != n) return false;
    std::vector<int> sx(nn);
    for(unsigned int j=0; j<n; ++j) {
      const IntegerVector &v = i[j];
      for (int const &k : v) {
        if(k > nn) return false;
        sx[k-1] = j;
      }
    }
    {
      std::vector<int> lup(n, -1);
      int j = 0;
      for(int &i : sx) {
        int &lupp = lup[i];
        if(lupp == -1) {lupp = j; i = j++;
        } else {i = lupp;}
      }
    }
    if(s!=sx) return false;
  }
  return true;
}
")

Run Code Online (Sandbox Code Playgroud)

感谢@Gregor 提供改进答案的提示！

@Gregor 感谢您提供按“min”排序的提示！ (2认同)

归档时间：	6 年，2 月前
查看次数：	360 次
最近记录：	4 年，6 月前

脚本中的ggplot图不会显示在Rstudio中 66

如何从data.frame中删除列？ 38

如何在Shiny R App中为select.put获取从server.R到ui.R的选项向量 31

为体育联盟创造自然时间表 12

如何使用高阶函数得到对角线的总和？ 8

TensorFlow 中 CPU 到 GPU 的数据传输速度慢吗？ 8

有效地计算大型python列表中的项目 5

Bash脚本定期列出文件 2

如果在Matlab中的FOR循环内部不起作用 1

我不能使用&&和||创建if语句 0

如何检查字符串是否包含JavaScript中的子字符串？ 7430

如何在ActionScript 3中将"Null"(真正的姓氏!)传递给SOAP Web服务？ 4603

如何将空目录添加到Git存储库？ 4039

显式关键字是什么意思？ 2753

.NET中的decimal,float和double之间的区别？ 2015

'real','user'和'sys'在time(1)的输出中意味着什么？ 1622

JavaScript发布请求,如表单提交 1465

如何删除导出的环境变量？ 1439

除了XHTML自包含标记之外,RegEx匹配开放标记 1323

什么是右值,左值,x值,glvalues和prvalues？ 1291