从R中的一列中选择共享另一列中的值的最大数量

mfe*_*ira 19 algorithm performance r igraph dataframe

我有一个大型数据集,其中包含 40 年来不定期采样的站点。我想选择共享的最大站点数,让\xe2\x80\x99s 说,至少 5 年的数据。

\n

任何指示将不胜感激。

\n

Here\xe2\x80\x99s 是一个示例数据集:

\n
library(tidyverse)\n\nset.seed(123)\n\nDF <- tibble(\n  Sites = 1:100,\n  NYears = rbinom(100, 40, .2)\n  ) %>%\n  rowwise() %>%\n  mutate(Years = list(sample(1982:2021, NYears))) %>%\n  unnest(Years) %>%\n  select(-NYears)\n
Run Code Online (Sandbox Code Playgroud)\n

Mar*_*ark 10

这是一个 na\xc3\xafve 解决方案,虽然它可能不适用于较大的数据集,但可能是寻找更好解决方案的良好起点:

\n
library(tidyverse)\n\nset.seed(123)\n\n# I create the dataset myself, because I don't want it to be unnested\nDF <- tibble(\n  Sites = 1:100,\n  NYears =rbinom(100, 40, .2)\n  ) %>%\n  rowwise() %>%\n  mutate(Years = list(sort(sample(1982:2021, NYears)))) # sorting the years is good for later when I want to find the combinations, I can be sure that they will be in the same order\n\n# basically, we're doing a crossjoin, filtering to overlaps larger than 5, then generating all possible combinations of those overlaps\noverlaps <- cross_join(DF, DF) %>%\n  filter(Sites.x < Sites.y) %>%\n  mutate(Overlap = list(intersect(Years.x, Years.y))) %>%\n  filter(length(Overlap) >= 5) %>%\n  mutate(combinations = list(combn(Overlap, 5, simplify = FALSE))) %>%\n  select(combinations, Sites.x, Sites.y) %>% \n  unnest(combinations)\n\nmost_common_fives <- overlaps %>%\n  count(combinations) %>%\n  slice_max(n) %>%\n  pull(combinations)\n\noverlaps %>%\n    filter(combinations %in% most_common_fives) %>%\n    group_by(combinations) %>%\n    summarise(values = (list(unique(c(Sites.x, c(Sites.y)))))) %>%\n    pull(combinations, values) \n\n$`c(26, 53, 84)`\n[1] 1989 1991 1998 2001 2011\n\n$`c(31, 59, 67)`\n[1] 1989 1992 1999 2002 2005\n
Run Code Online (Sandbox Code Playgroud)\n


Tho*_*ing 10

免责声明

下面的方法不如@jblood94的解决方案有效(因此,如果您追求速度,请不要将我的解决方案用于大型数据集),而只是通过以图论的方式思考来改变思维方式并探索使用igraph来解决问题的可能性


简要想法(图论视角)

总的来说,我认为这个问题可以用图论的方式来处理,用 来解决igraph。如果你追求效率,你可能需要探索隐藏在图表背后的潜在属性。例如:

  • 共享数量Years可以解释为与两个顶点关联的边权重Sites
  • <=4此外,由于在搜索派系时可以跳过一些具有权重的边,因此可以进一步简化图。修剪网络并随后搜索应该比迭代所有可能的组合更有效。

如果您对详细信息感兴趣,请参阅后续答案和代码细分


一种igraph方法

下面可能是igraph解决该问题的一种选择(有关详细信息,请参阅代码注释):您可以尝试graph_from_adjacency_matrixSites使用 找到派系cliques(),例如,

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
    graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
    ) %>%
    # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
    subgraph.edges(E(.)[E(.)$weight > 4]) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)

或替代方案

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    `>=`(5) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)

这使

> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67

[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84
Run Code Online (Sandbox Code Playgroud)

如果您想进一步显示共享年份,您可以在 之上采取其他操作res,例如,

lapply(
    res,
    \(q) {
        list(
            sites = as.integer(names(q)),
            sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
        )
    }
)
Run Code Online (Sandbox Code Playgroud)

这使

[[1]]
[[1]]$sites
[1] 31 59 67

[[1]]$shared_years
[1] 1989 1992 1999 2002 2005


[[2]]
[[2]]$sites
[1] 26 53 84

[[2]]$shared_years
[1] 1989 1991 1998 2001 2011
Run Code Online (Sandbox Code Playgroud)

讨论

igraph上面的选项中,cliques()将是性能瓶颈,特别是当条件是“共享的数量Years应该是>=k”时,对于小ks,例如k=1k=2cliques()在这些情况下, before可以枚举更多的派系Filter()。您可以参考@jblood94的基准测试结果。


jbl*_*d94 8

这是一种从满足标准的站点对开始的方法,然后迭代地将站点添加到每个组中。它非常高效,几乎可以立即解决 OP 数据集上的 5 个共享年问题。

几乎可以肯定,这种方法的效率还有提高的空间,但这应该为您提供一个良好的起点。

作为函数实现:

library(Matrix)
library(data.table)
library(Rfast) # for rowSort()

f <- function(df, shared) {
  if (shared == 1) {
    # special case
    # get the yeear(s) with the maximum number of sites
    dt <- setorder(setDT(df), Sites, Years)[
      Years %in% df[,.(.N), Years][N == max(N)][[1]],
      setDT(setNames(as.list(Sites), paste0("Site", 1:.N))), Years
    ]
    # combine years with the same set of sites
    setcolorder(
      dt[
        ,.(nShared = .N, Years = .(Years)), eval(names(dt)[-1])
      ], "nShared"
    )
  } else if (shared == 2) {
    # special case
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along rows
      match(df$Years, u1 <- sort(unique(df$Years))), # years along columns
      x = 1L
    )
    # shared sites between year pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    # return an empty data.table if no pairs meet the criterion
    if (!length(m2@x)) return(
      data.table(nShared = integer(0), Site1 = integer(0), Years = list())
    )
    # find the year pairs that share the maximum number of sites
    i <- which(m2@x == (mx <- max(m2@x)))
    setcolorder(
      # initialize the output
      as.data.table(
        matrix(
          u2[(m[,m2@i[i] + 1L, drop = FALSE]*m[,m2@j[i] + 1L, drop = FALSE])@i + 1L],
          length(i), mx, 1, list(NULL, paste0("Site", 1:mx))
        )
      )[
        , Years := as.list(
          as.data.frame(
            matrix(u1[c(m2@i[i] + 1L, m2@j[i] + 1L)], 2, byrow = TRUE)
          )
        )
        # combine year pairs with the same set of sites
      ][,.(Years = .(unique(unlist(Years)))), eval(paste0("Site", 1:mx))][
        ,nShared := lengths(Years)
      ], "nShared"
    )
  } else {
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Years, u1 <- sort(unique(df$Years))), # years along rows
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along columns
      x = 1L
    )
    # shared years between site pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    i <- m2@x >= shared # index of pairings with at least "shared" years shared
    # initialize the output
    dt <- data.table(nShared = m2@x[i], Site1 = m2@i[i] + 1L, Site2 = m2@j[i] + 1L)
    # return an empty data.table if no pairs meet the criterion
    if (!nrow(dt)) return(dt[,Years := list()])
    n <- 2L # current number of sites all sharing "shared" number of years
    
    while (1) {
      # find additional sites to that can be added to each group
      m2 <- as(crossprod(Reduce("*", lapply(2:(n + 1L), \(i) m[,dt[[i]], drop = FALSE])), m), "TsparseMatrix")
      # don't add the same site twice!
      m2[cbind(rep(1:nrow(dt), n), unlist(dt[,2:(n + 1L)]))] <- 0
      i <- m2@x >= shared
      
      if (any(i)) { # an additional site can be added to at least one group
        n <- n + 1L
        # update the outupt
        dt <- unique(dt[m2@i[i] + 1L][,paste0("Site", n) := m2@j[i] + 1L][
          ,nShared := m2@x[i]
        ][
          ,paste0("Site", 1:n) := as.data.frame(rowSort(as.matrix(.SD))),
          .SDcols = 2:(n + 1L)
        ])
      } else break
    }
    
    # convert site indices back to the original values and record the common
    # years for each group
    dt[
      ,Years := .(.(u1[as.logical(Reduce("*", lapply(.SD, \(i) m[,i])))])),
      1:nrow(dt), .SDcols = 2:(n + 1L)
    ][
      ,paste0("Site", 1:n) := as.data.frame(matrix(u2[unlist(.SD)], .N)),
      .SDcols = 2:(n + 1L)
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

测试:

f(DF, 1)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13 Site14 Site15 Site16 Site17 Site18 Site19 Site20 Site21 Site22 Site23 Site24 Site25 Site26 Site27 Site28 Site29 Site30 Site31 Site32 Years
#> 1:       1     1     2     4     5    11    12    13    14    22     38     41     42     46     47     48     54     58     59     62     65     66     69     71     72     75     80     87     88     90     92     93    100  2004
f(DF, 2)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13     Years
#> 1:       2     3    11    26    31    37    38    49    53    55     64     65     68     84 1989,1991
f(DF, 3)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6          Years
#> 1:       3    24    48    61    87    95    96 1984,2017,2018
#> 2:       3    26    31    37    53    65    84 1989,1991,2001
#> 3:       3    26    31    53    55    64    84 1989,1991,1998
f(DF, 4)[]
#>    nShared Site1 Site2 Site3 Site4               Years
#> 1:       4    24    48    87    95 1984,2014,2017,2018
#> 2:       4    26    31    53    84 1989,1991,1998,2001
#> 3:       4    26    37    53    84 1989,1991,2001,2011
f(DF, 5)[]
#>    nShared Site1 Site2 Site3                    Years
#> 1:       5    26    53    84 1989,1991,1998,2001,2011
#> 2:       5    31    59    67 1989,1992,1999,2002,2005
f(DF, 6)[]
#>     nShared Site1 Site2                             Years
#>  1:       6     5    13     1988,2004,2007,2010,2013,2021
#>  2:       6    11    13     1988,2004,2006,2007,2010,2020
#>  3:       7    26    31 1989,1991,1998,1999,2001,2017,...
#>  4:       6     5    32     1986,1987,1991,2007,2010,2013
#>  5:       6    31    59     1989,1992,1993,1999,2002,2005
#>  6:       6    31    67     1989,1990,1992,1999,2002,2005
#>  7:       6    59    67     1985,1989,1992,1999,2002,2005
#>  8:       6    31    68     1989,1991,1992,2002,2009,2021
#>  9:       6    10    69     1983,1987,1989,1994,1995,2013
#> 10:       6     4    84     1998,2001,2003,2005,2011,2018
#> 11:       6    31    84     1989,1990,1991,1998,2001,2005
#> 12:       6    20    87     1994,2001,2010,2014,2015,2021
#> 13:       7    24    87 1984,2002,2006,2014,2015,2017,...
#> 14:       7    58    87 1986,2004,2005,2006,2014,2017,...
#> 15:       6    68    87     1984,1994,2002,2014,2015,2021
#> 16:       6    72    87     1986,2001,2002,2004,2010,2017
#> 17:       6    31    88     1989,1992,1993,2005,2009,2021
#> 18:       6    33    88     2005,2008,2009,2011,2013,2021
#> 19:       6    82    88     1989,1997,2005,2011,2013,2021
#> 20:       6     8    89     1982,2000,2003,2006,2011,2017
#> 21:       6    24    89     1984,1996,2000,2006,2011,2017
#> 22:       6     5    92     1987,1991,1997,2004,2007,2013
#> 23:       6    21    94     1984,1985,1987,1993,1999,2020
#> 24:       6     8    97     1992,1997,2003,2006,2011,2020
#>     nShared Site1 Site2                             Years
f(DF, 8)[]
#> Empty data.table (0 rows and 4 cols): nShared,Site1,Site2,Years
Run Code Online (Sandbox Code Playgroud)

标杆管理

基准测试将将此方法与 @ThomasIsCoding 的igraph解决方案进行比较。然而,igraph当为最小共享年份数选择较小的值(OP 数据集为 1 或 2)时,速度会变得非常慢。这些被排除在基准之外。

igraph作为函数的解:

library(igraph)

f2 <- function(DF, shared) {
  lapply(
    DF %>%
      table() %>%
      tcrossprod() %>%
      # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
      graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
      ) %>%
      # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
      subgraph.edges(E(.)[E(.)$weight >= shared]) %>%
      # find all cliques
      cliques() %>%
      # double check if `Sites` in each clique meet the condition, using full info from `DF`
      Filter(
        \(q) {
          sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= shared
        }, .
      ) %>%
      # pick the clique that consists of the maximum number of `Sites`
      `[`(lengths(.) == max(lengths(.))),
    \(q) {
      list(
        sites = as.integer(names(q)),
        sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
      )
    }
  )
}
Run Code Online (Sandbox Code Playgroud)

定时:

# The following will run for several minutes without returning a
# result. They will be excluded from the benchmark.
# system.time(f2(DF, 1))
# system.time(f2(DF, 2))

microbenchmark::microbenchmark(
  matrix1 = f(DF, 1),
  # igraph1 = f2(DF, 1),
  matrix2 = f(DF, 2),
  # igraph2 = f2(DF, 2),
  matrix3 = f(DF, 3),
  igraph3 = f2(DF, 3),
  matrix4 = f(DF, 4),
  igraph4 = f2(DF, 4),
  matrix5 = f(DF, 5),
  igraph5 = f2(DF, 5),
  matrix6 = f(DF, 6),
  igraph6 = f2(DF, 6),
  matrix8 = f(DF, 8),
  igraph8 = f2(DF, 8),
  times = 10
)

#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  matrix1    4.6677    6.1859    6.61501    6.83370    7.0784    8.2880    10
#>  matrix2    2.4428    2.5075    2.84573    2.80790    3.1315    3.5438    10
#>  matrix3   35.9857   37.0604   43.86820   41.89480   49.0314   61.3006    10
#>  igraph3 5934.9483 6136.9761 6240.44407 6278.98705 6362.0348 6471.6932    10
#>  matrix4   11.7949   12.2703   13.64224   13.99900   14.4769   16.3989    10
#>  igraph4  184.7337  191.0718  210.06073  203.10085  218.6591  255.5529    10
#>  matrix5    6.3713    6.8209    7.23380    7.27330    7.6825    7.9681    10
#>  igraph5   26.8565   29.4616   32.98676   33.06670   35.5976   39.0243    10
#>  matrix6    6.3562    6.5861    7.14332    6.86025    7.6393    8.6200    10
#>  igraph6   10.9108   11.6408   14.39365   12.39585   14.8864   28.6385    10
#>  matrix8    1.1340    1.2327    1.33831    1.33730    1.4175    1.5043    10
#>  igraph8    2.9498    3.6359    3.95806    3.72040    4.3448    5.2509    10
Run Code Online (Sandbox Code Playgroud)

这种igraph方法始终较慢,有时甚至非常慢。


数据

set.seed(123)

(DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
) %>%
    rowwise() %>%
    mutate(Years = list(sample(1982:2021, NYears))) %>%
    unnest(Years) %>%
    select(-NYears))
Run Code Online (Sandbox Code Playgroud)


Rob*_*ken 5

这是我最初的暴力方法的(某种程度上)优化版本,它现在不会经历所有可能的年份组合,而是迭代地构建n +1 年内采样的站点列表。

yr.sites <- tapply(DF$Sites, DF$Years, identity, simplify=F)
n.yrs <- length(yr.sites)
# year indices of each combination are stored in an attribute 'yrs'
for (i in seq_len(n.yrs)) attr(yr.sites[[i]], 'yrs') <- i
  
add_year <- function(ycs) {
  ycs <- lapply(ycs, \(sites) {
    yrs.comb <- attr(sites, 'yrs')
    last.yr <- tail(yrs.comb, 1)
    if (last.yr < n.yrs) { 
      # find subsets of sites sampled also in another year (for all years 
      #   after the last year already considered)
      lapply(seq(last.yr+1, n.yrs), \(yr) {
        inter <- intersect(sites, yr.sites[[yr]])
        # keep only intersections containing two or more sites
        if (length(inter) > 1) {
          attr(inter, 'yrs') <- c(yrs.comb, yr)
          inter
        } else NULL
      })
    } else NULL
  })
  ycs <- unlist(ycs, recursive=F)
  ycs[lengths(ycs) > 0]
}

yrs.comb.sites <- yr.sites
for (i in 2:5) {
  yrs.comb.sites <- add_year(yrs.comb.sites)
}
Run Code Online (Sandbox Code Playgroud)

仍然需要大约一秒钟的时间来计算,因此其他答案可能更有效,但与暴力方法相比,显着减少了访问的组合数量。

获奖者是:

best <- unname(yrs.comb.sites[lengths(yrs.comb.sites)==max(lengths(yrs.comb.sites))])
for (i in seq_along(best)) {
  attr(best[[i]], 'yrs') <- as.numeric(names(yr.sites))[attr(best[[i]], 'yrs')]
}
best
# [[1]]
# [1] 26 53 84
# attr(,"yrs")
# [1] 1989 1991 1998 2001 2011
# 
# [[2]]
# [1] 31 59 67
# attr(,"yrs")
# [1] 1989 1992 1999 2002 2005

Run Code Online (Sandbox Code Playgroud)

原答案

这是一种非常慢的暴力方法,@Mark 和 @jblood94 都提供了更有效的解决方案。在我的机器上(老实说,不是一台强大的机器),计算大约需要 80 秒。

yr.site <- tapply(DF$Sites, DF$Years, identity, simplify=F)
year.comb.sites <- combn(yr.site, 5, Reduce, f=intersect, simplify=F)
max.groups <- year.comb.sites[lengths(year.comb.sites) == max(lengths(year.comb.sites))]
max.groups
# [[1]]
# [1] 26 53 84
# 
# [[2]]
# [1] 31 59 67
Run Code Online (Sandbox Code Playgroud)

代码很短,但是背后有大量的计算。yr.sites是存储每年采样站点的列表。该combn行生成长度为 5 的年份的所有组合(有 658008 个这样的组合),并为每个组合查找所有这些年份中采样的站点。最终有2组最多3个满足初始条件的站点。

我们可以通过以下方式确认这一点

sort(table(subset(DF, Sites %in% max.groups[[1]])$Years))
# 1984 1988 1990 1993 1994 1996 1999 2000 2003 2005 2010 2017 2018 2021 1989 1991 1998 2001 2011 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    3    3    3    3    3 
Run Code Online (Sandbox Code Playgroud)

这表明,例如第一组中的所有三个地点确实是在 1989 年、1991 年、1998 年、2001 年和 2011 年采样的。

正如我已经说过的,如果您想要请求超过 5 年的共享时间,此解决方案并不有效,并且无法很好地扩展。例如,在 6 年的情况下,组合数 ( choose(40, 6)) 将增加到 3838380 个,更高的年数也不是好消息。


Tho*_*ing 5

igraph解决方案的后续:分解代码

我不希望我的igraph答案太长而难以阅读,因此我创建了一个新答案来详细说明该解决方案的工作原理。


虚拟示例

为简单起见,我们可以从一个较小的虚拟示例开始,我们假设我们的目标是找到至少years的最大份额Sites3

set.seed(0)
DF <- data.frame(
    Sites = rep(1:10, each = 10),
    Years = sample(2000:2020, 100, replace = TRUE)
) %>%
    unique()
Run Code Online (Sandbox Code Playgroud)

图表中的可视化DF看起来可以通过

DF %>%
    graph_from_data_frame() %>%
    set_vertex_attr(name = "color", value = names(V(.)) %in% DF$Sites) %>%
    plot()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


脚步

1)。正如我们所看到的,该图可以被视为二分图Sites并且Years是两种不同类型的顶点),并且共享Years实际上可以投影到边。由于我们需要跟踪共享的数量Years,因此我们可以使用边缘属性"weight"来识别共享的计数YearsSites在这种情况下,在投影之前需要一个 的邻接矩阵table+tcrossprod,例如,

adjmat <- DF %>%
    table() %>%
    tcrossprod()
Run Code Online (Sandbox Code Playgroud)

这使

> adjmat
     Sites
Sites 1 2 3 4 5 6 7 8 9 10
   1  8 2 3 0 4 4 4 2 5  3
   2  2 7 5 2 3 1 2 3 2  4
   3  3 5 9 4 4 3 2 3 2  4
   4  0 2 4 6 3 2 1 2 1  3
   5  4 3 4 3 9 5 4 4 4  4
   6  4 1 3 2 5 7 2 3 2  3
   7  4 2 2 1 4 2 9 4 3  4
   8  2 3 3 2 4 3 4 9 2  5
   9  5 2 2 1 4 2 3 2 7  4
   10 3 4 4 3 4 3 4 5 4  9
Run Code Online (Sandbox Code Playgroud)

2)。正如目标中所述,我们想要找出拥有>=3共同年份的群体,这意味着"weight"边数至少应为3。在步骤1)得到的邻接矩阵之上adjmat,我们可以进一步应用过滤器(>=3)来简化矩阵,这相当于对网络进行剪枝,即

g <- adjmat %>%
    `>=`(3) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE)
Run Code Online (Sandbox Code Playgroud)

plot(g)显示了如下所示的项目 在此输入图像描述

3)。我们知道,Sites共享相同的Years应该产生一个完整的子图,即clique。因此,我们可以枚举图中的所有派系g,这是由函数 完成的cliques(),即

clq <- g %>%
    cliques(min = 2)
Run Code Online (Sandbox Code Playgroud)

其中min = 2指定每个派系的最小大小。如果您事先没有有关大小的信息,也可以不使用cliques()任何其他参数来使用。现在,clq是一个看起来像的列表

> clq 
[[1]]
+ 2/10 vertices, named, from 3b71441:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 5

[[3]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  5  10

[[4]]
+ 2/10 vertices, named, from 3b71441:
[1] 3  10

[[5]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 7

[[6]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  7  10

[[7]]
+ 2/10 vertices, named, from 3b71441:
[1] 7  10

[[8]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 8

[[9]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 8

[[10]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  8  10

[[11]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  8  10

[[12]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 8

[[13]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 8

[[14]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  8  10

[[15]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  8  10

[[16]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 8

[[17]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  8  10

[[18]]
+ 2/10 vertices, named, from 3b71441:
[1] 8  10

[[19]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 7

[[20]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 7

[[21]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  7  10

[[22]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  7  10

[[23]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 3

[[24]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 5

[[25]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  5  10

[[26]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  3  10

[[27]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 5

[[28]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  5  10

[[29]]
+ 2/10 vertices, named, from 3b71441:
[1] 1  10

[[30]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 4

[[31]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 5

[[32]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  4  5  10

[[33]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  4  10

[[34]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 5

[[35]]
+ 3/10 vertices, named, from 3b71441:
[1] 4  5  10

[[36]]
+ 2/10 vertices, named, from 3b71441:
[1] 4  10

[[37]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 9

[[38]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 9

[[39]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 9

[[40]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  5  7  9  10

[[41]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  7  9  10

[[42]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 9

[[43]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  9  10

[[44]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  9  10

[[45]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 9

[[46]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 9

[[47]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  9  10

[[48]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  9  10

[[49]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 9

[[50]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  9  10

[[51]]
+ 2/10 vertices, named, from 3b71441:
[1] 9  10

[[52]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 6

[[53]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 6

[[54]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 6

[[55]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  3  5  6  10

[[56]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  6  10

[[57]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 6

[[58]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  6  10

[[59]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  6  10

[[60]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 8

[[61]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 8

[[62]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 8

[[63]]
+ 5/10 vertices, named, from 3b71441:
[1] 3  5  6  8  10

[[64]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  6  8  10

[[65]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 8

[[66]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  6  8  10

[[67]]
+ 3/10 vertices, named, from 3b71441:
[1] 6  8  10

[[68]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 6

[[69]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 6

[[70]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  6  10

[[71]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  6  10

[[72]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 6

[[73]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  6  10

[[74]]
+ 2/10 vertices, named, from 3b71441:
[1] 6  10

[[75]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 8

[[76]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 8

[[77]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 8

[[78]]
+ 5/10 vertices, named, from 3b71441:
[1] 2  3  5  8  10

[[79]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  8  10

[[80]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 8

[[81]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  5  8  10

[[82]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  8  10

[[83]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 3

[[84]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 5

[[85]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  5  10

[[86]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  3  10

[[87]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 5

[[88]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  5  10

[[89]]
+ 2/10 vertices, named, from 3b71441:
[1] 2  10
Run Code Online (Sandbox Code Playgroud)

4). 需要注意的是,并不是所有的派系都满足>=3共享的要求Years,因为“权重”表示的是总共享的数量Years,而不是区分共享的数量Years。换句话说,2000, 2001, 2002有效2000, 2000, 2002无效,尽管后者的计数为3。因此,我们需要检查每个派系中共享的分布Years

例如,如果我们查看第三个派系,即 ,clq[[3]]并检查共享的分布Years

> q <- clq[[3]]

> table(subset(DF, Sites %in% names(q)))
     Years
Sites 2000 2001 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
   3     1    1    1    0    0    0    1    1    0    1    0    1    1    0
   5     0    1    0    1    1    1    0    1    0    1    1    1    0    0
   10    1    0    0    0    1    1    0    1    1    1    0    0    1    1
     Years
Sites 2018 2019 2020
   3     0    1    0
   5     1    0    0
   10    0    0    1
Run Code Online (Sandbox Code Playgroud)

我们看到所有1s 的列都是20092011,这意味着它们 只有 2 个共享Years,因此无效。要选择有效的派系,我们可以使用Filter过滤标准,例如,

validclq <- clq %>%
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 3
        }, .
    )
Run Code Online (Sandbox Code Playgroud)

我们将会看到

> validclq
[[1]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 5

[[3]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3  10

[[4]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 7

[[5]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7  10

[[6]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 8

[[7]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 8

[[8]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 8

[[9]]
+ 2/10 vertices, named, from 9bd9430:
[1] 8  10

[[10]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 7

[[11]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 3

[[12]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 5

[[13]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1  10

[[14]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 4

[[15]]
+ 3/10 vertices, named, from 9bd9430:
[1] 3  4  10

[[16]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 5

[[17]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4  10

[[18]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 9

[[19]]
+ 3/10 vertices, named, from 9bd9430:
[1] 1 5 9

[[20]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 9

[[21]]
+ 3/10 vertices, named, from 9bd9430:
[1] 7  9  10

[[22]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 9

[[23]]
+ 2/10 vertices, named, from 9bd9430:
[1] 9  10

[[24]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 6

[[25]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 8

[[26]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 6

[[27]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 6

[[28]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6  10

[[29]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 8

[[30]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 3

[[31]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 5

[[32]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2  10
Run Code Online (Sandbox Code Playgroud)

其中 89 个派系中有 32 个满足要求。

5)。作为最后一步,我们从有效的 cliques 中选择具有最大尺寸的 cliques(因为我们正在寻找 的最大数量。使用获取每个团的大小,我们可以过滤最大的团,例如,Sitesvalidclqlenghts

res <- validclq %>%
    `[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)

我们终于得到了

[[1]]
+ 3/10 vertices, named, from 75803c6:
[1] 3  4  10

[[2]]
+ 3/10 vertices, named, from 75803c6:
[1] 1 5 9

[[3]]
+ 3/10 vertices, named, from 75803c6:
[1] 7  9  10
Run Code Online (Sandbox Code Playgroud)