mfe*_*ira 19 algorithm performance r igraph dataframe
我有一个大型数据集,其中包含 40 年来不定期采样的站点。我想选择共享的最大站点数,让\xe2\x80\x99s 说,至少 5 年的数据。
\n任何指示将不胜感激。
\nHere\xe2\x80\x99s 是一个示例数据集:
\nlibrary(tidyverse)\n\nset.seed(123)\n\nDF <- tibble(\n Sites = 1:100,\n NYears = rbinom(100, 40, .2)\n ) %>%\n rowwise() %>%\n mutate(Years = list(sample(1982:2021, NYears))) %>%\n unnest(Years) %>%\n select(-NYears)\nRun Code Online (Sandbox Code Playgroud)\n
Mar*_*ark 10
这是一个 na\xc3\xafve 解决方案,虽然它可能不适用于较大的数据集,但可能是寻找更好解决方案的良好起点:
\nlibrary(tidyverse)\n\nset.seed(123)\n\n# I create the dataset myself, because I don't want it to be unnested\nDF <- tibble(\n Sites = 1:100,\n NYears =rbinom(100, 40, .2)\n ) %>%\n rowwise() %>%\n mutate(Years = list(sort(sample(1982:2021, NYears)))) # sorting the years is good for later when I want to find the combinations, I can be sure that they will be in the same order\n\n# basically, we're doing a crossjoin, filtering to overlaps larger than 5, then generating all possible combinations of those overlaps\noverlaps <- cross_join(DF, DF) %>%\n filter(Sites.x < Sites.y) %>%\n mutate(Overlap = list(intersect(Years.x, Years.y))) %>%\n filter(length(Overlap) >= 5) %>%\n mutate(combinations = list(combn(Overlap, 5, simplify = FALSE))) %>%\n select(combinations, Sites.x, Sites.y) %>% \n unnest(combinations)\n\nmost_common_fives <- overlaps %>%\n count(combinations) %>%\n slice_max(n) %>%\n pull(combinations)\n\noverlaps %>%\n filter(combinations %in% most_common_fives) %>%\n group_by(combinations) %>%\n summarise(values = (list(unique(c(Sites.x, c(Sites.y)))))) %>%\n pull(combinations, values) \n\n$`c(26, 53, 84)`\n[1] 1989 1991 1998 2001 2011\n\n$`c(31, 59, 67)`\n[1] 1989 1992 1999 2002 2005\nRun Code Online (Sandbox Code Playgroud)\n
Tho*_*ing 10
下面的方法不如@jblood94的解决方案有效(因此,如果您追求速度,请不要将我的解决方案用于大型数据集),而只是通过以图论的方式思考来改变思维方式并探索使用igraph来解决问题的可能性。
总的来说,我认为这个问题可以用图论的方式来处理,用 来解决igraph。如果你追求效率,你可能需要探索隐藏在图表背后的潜在属性。例如:
Years可以解释为与两个顶点关联的边权重Sites。<=4此外,由于在搜索派系时可以跳过一些具有权重的边,因此可以进一步简化图。修剪网络并随后搜索应该比迭代所有可能的组合更有效。如果您对详细信息感兴趣,请参阅后续答案和代码细分。
igraph方法下面可能是igraph解决该问题的一种选择(有关详细信息,请参阅代码注释):您可以尝试graph_from_adjacency_matrix并Sites使用 找到派系cliques(),例如,
res <- DF %>%
table() %>%
tcrossprod() %>%
# build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
graph_from_adjacency_matrix(
"undirected",
diag = FALSE,
weighted = TRUE
) %>%
# prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
subgraph.edges(E(.)[E(.)$weight > 4]) %>%
# find all cliques
cliques(min = 2) %>%
# double check if `Sites` in each clique meet the condition, using full info from `DF`
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
}, .
) %>%
# pick the clique that consists of the maximum number of `Sites`
`[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)
或替代方案
res <- DF %>%
table() %>%
tcrossprod() %>%
`>=`(5) %>%
graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
# find all cliques
cliques(min = 2) %>%
# double check if `Sites` in each clique meet the condition, using full info from `DF`
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
}, .
) %>%
# pick the clique that consists of the maximum number of `Sites`
`[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)
这使
> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67
[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84
Run Code Online (Sandbox Code Playgroud)
如果您想进一步显示共享年份,您可以在 之上采取其他操作res,例如,
lapply(
res,
\(q) {
list(
sites = as.integer(names(q)),
sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
)
}
)
Run Code Online (Sandbox Code Playgroud)
这使
[[1]]
[[1]]$sites
[1] 31 59 67
[[1]]$shared_years
[1] 1989 1992 1999 2002 2005
[[2]]
[[2]]$sites
[1] 26 53 84
[[2]]$shared_years
[1] 1989 1991 1998 2001 2011
Run Code Online (Sandbox Code Playgroud)
在igraph上面的选项中,cliques()将是性能瓶颈,特别是当条件是“共享的数量Years应该是>=k”时,对于小ks,例如k=1或k=2。cliques()在这些情况下, before可以枚举更多的派系Filter()。您可以参考@jblood94的基准测试结果。
这是一种从满足标准的站点对开始的方法,然后迭代地将站点添加到每个组中。它非常高效,几乎可以立即解决 OP 数据集上的 5 个共享年问题。
几乎可以肯定,这种方法的效率还有提高的空间,但这应该为您提供一个良好的起点。
作为函数实现:
library(Matrix)
library(data.table)
library(Rfast) # for rowSort()
f <- function(df, shared) {
if (shared == 1) {
# special case
# get the yeear(s) with the maximum number of sites
dt <- setorder(setDT(df), Sites, Years)[
Years %in% df[,.(.N), Years][N == max(N)][[1]],
setDT(setNames(as.list(Sites), paste0("Site", 1:.N))), Years
]
# combine years with the same set of sites
setcolorder(
dt[
,.(nShared = .N, Years = .(Years)), eval(names(dt)[-1])
], "nShared"
)
} else if (shared == 2) {
# special case
# create a sparse matrix with year-site pairs
m <- sparseMatrix(
match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along rows
match(df$Years, u1 <- sort(unique(df$Years))), # years along columns
x = 1L
)
# shared sites between year pairs
m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
# return an empty data.table if no pairs meet the criterion
if (!length(m2@x)) return(
data.table(nShared = integer(0), Site1 = integer(0), Years = list())
)
# find the year pairs that share the maximum number of sites
i <- which(m2@x == (mx <- max(m2@x)))
setcolorder(
# initialize the output
as.data.table(
matrix(
u2[(m[,m2@i[i] + 1L, drop = FALSE]*m[,m2@j[i] + 1L, drop = FALSE])@i + 1L],
length(i), mx, 1, list(NULL, paste0("Site", 1:mx))
)
)[
, Years := as.list(
as.data.frame(
matrix(u1[c(m2@i[i] + 1L, m2@j[i] + 1L)], 2, byrow = TRUE)
)
)
# combine year pairs with the same set of sites
][,.(Years = .(unique(unlist(Years)))), eval(paste0("Site", 1:mx))][
,nShared := lengths(Years)
], "nShared"
)
} else {
# create a sparse matrix with year-site pairs
m <- sparseMatrix(
match(df$Years, u1 <- sort(unique(df$Years))), # years along rows
match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along columns
x = 1L
)
# shared years between site pairs
m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
i <- m2@x >= shared # index of pairings with at least "shared" years shared
# initialize the output
dt <- data.table(nShared = m2@x[i], Site1 = m2@i[i] + 1L, Site2 = m2@j[i] + 1L)
# return an empty data.table if no pairs meet the criterion
if (!nrow(dt)) return(dt[,Years := list()])
n <- 2L # current number of sites all sharing "shared" number of years
while (1) {
# find additional sites to that can be added to each group
m2 <- as(crossprod(Reduce("*", lapply(2:(n + 1L), \(i) m[,dt[[i]], drop = FALSE])), m), "TsparseMatrix")
# don't add the same site twice!
m2[cbind(rep(1:nrow(dt), n), unlist(dt[,2:(n + 1L)]))] <- 0
i <- m2@x >= shared
if (any(i)) { # an additional site can be added to at least one group
n <- n + 1L
# update the outupt
dt <- unique(dt[m2@i[i] + 1L][,paste0("Site", n) := m2@j[i] + 1L][
,nShared := m2@x[i]
][
,paste0("Site", 1:n) := as.data.frame(rowSort(as.matrix(.SD))),
.SDcols = 2:(n + 1L)
])
} else break
}
# convert site indices back to the original values and record the common
# years for each group
dt[
,Years := .(.(u1[as.logical(Reduce("*", lapply(.SD, \(i) m[,i])))])),
1:nrow(dt), .SDcols = 2:(n + 1L)
][
,paste0("Site", 1:n) := as.data.frame(matrix(u2[unlist(.SD)], .N)),
.SDcols = 2:(n + 1L)
]
}
}
Run Code Online (Sandbox Code Playgroud)
测试:
f(DF, 1)[]
#> nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13 Site14 Site15 Site16 Site17 Site18 Site19 Site20 Site21 Site22 Site23 Site24 Site25 Site26 Site27 Site28 Site29 Site30 Site31 Site32 Years
#> 1: 1 1 2 4 5 11 12 13 14 22 38 41 42 46 47 48 54 58 59 62 65 66 69 71 72 75 80 87 88 90 92 93 100 2004
f(DF, 2)[]
#> nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13 Years
#> 1: 2 3 11 26 31 37 38 49 53 55 64 65 68 84 1989,1991
f(DF, 3)[]
#> nShared Site1 Site2 Site3 Site4 Site5 Site6 Years
#> 1: 3 24 48 61 87 95 96 1984,2017,2018
#> 2: 3 26 31 37 53 65 84 1989,1991,2001
#> 3: 3 26 31 53 55 64 84 1989,1991,1998
f(DF, 4)[]
#> nShared Site1 Site2 Site3 Site4 Years
#> 1: 4 24 48 87 95 1984,2014,2017,2018
#> 2: 4 26 31 53 84 1989,1991,1998,2001
#> 3: 4 26 37 53 84 1989,1991,2001,2011
f(DF, 5)[]
#> nShared Site1 Site2 Site3 Years
#> 1: 5 26 53 84 1989,1991,1998,2001,2011
#> 2: 5 31 59 67 1989,1992,1999,2002,2005
f(DF, 6)[]
#> nShared Site1 Site2 Years
#> 1: 6 5 13 1988,2004,2007,2010,2013,2021
#> 2: 6 11 13 1988,2004,2006,2007,2010,2020
#> 3: 7 26 31 1989,1991,1998,1999,2001,2017,...
#> 4: 6 5 32 1986,1987,1991,2007,2010,2013
#> 5: 6 31 59 1989,1992,1993,1999,2002,2005
#> 6: 6 31 67 1989,1990,1992,1999,2002,2005
#> 7: 6 59 67 1985,1989,1992,1999,2002,2005
#> 8: 6 31 68 1989,1991,1992,2002,2009,2021
#> 9: 6 10 69 1983,1987,1989,1994,1995,2013
#> 10: 6 4 84 1998,2001,2003,2005,2011,2018
#> 11: 6 31 84 1989,1990,1991,1998,2001,2005
#> 12: 6 20 87 1994,2001,2010,2014,2015,2021
#> 13: 7 24 87 1984,2002,2006,2014,2015,2017,...
#> 14: 7 58 87 1986,2004,2005,2006,2014,2017,...
#> 15: 6 68 87 1984,1994,2002,2014,2015,2021
#> 16: 6 72 87 1986,2001,2002,2004,2010,2017
#> 17: 6 31 88 1989,1992,1993,2005,2009,2021
#> 18: 6 33 88 2005,2008,2009,2011,2013,2021
#> 19: 6 82 88 1989,1997,2005,2011,2013,2021
#> 20: 6 8 89 1982,2000,2003,2006,2011,2017
#> 21: 6 24 89 1984,1996,2000,2006,2011,2017
#> 22: 6 5 92 1987,1991,1997,2004,2007,2013
#> 23: 6 21 94 1984,1985,1987,1993,1999,2020
#> 24: 6 8 97 1992,1997,2003,2006,2011,2020
#> nShared Site1 Site2 Years
f(DF, 8)[]
#> Empty data.table (0 rows and 4 cols): nShared,Site1,Site2,Years
Run Code Online (Sandbox Code Playgroud)
基准测试将将此方法与 @ThomasIsCoding 的igraph解决方案进行比较。然而,igraph当为最小共享年份数选择较小的值(OP 数据集为 1 或 2)时,速度会变得非常慢。这些被排除在基准之外。
igraph作为函数的解:
library(igraph)
f2 <- function(DF, shared) {
lapply(
DF %>%
table() %>%
tcrossprod() %>%
# build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
graph_from_adjacency_matrix(
"undirected",
diag = FALSE,
weighted = TRUE
) %>%
# prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
subgraph.edges(E(.)[E(.)$weight >= shared]) %>%
# find all cliques
cliques() %>%
# double check if `Sites` in each clique meet the condition, using full info from `DF`
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= shared
}, .
) %>%
# pick the clique that consists of the maximum number of `Sites`
`[`(lengths(.) == max(lengths(.))),
\(q) {
list(
sites = as.integer(names(q)),
sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
)
}
)
}
Run Code Online (Sandbox Code Playgroud)
定时:
# The following will run for several minutes without returning a
# result. They will be excluded from the benchmark.
# system.time(f2(DF, 1))
# system.time(f2(DF, 2))
microbenchmark::microbenchmark(
matrix1 = f(DF, 1),
# igraph1 = f2(DF, 1),
matrix2 = f(DF, 2),
# igraph2 = f2(DF, 2),
matrix3 = f(DF, 3),
igraph3 = f2(DF, 3),
matrix4 = f(DF, 4),
igraph4 = f2(DF, 4),
matrix5 = f(DF, 5),
igraph5 = f2(DF, 5),
matrix6 = f(DF, 6),
igraph6 = f2(DF, 6),
matrix8 = f(DF, 8),
igraph8 = f2(DF, 8),
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> matrix1 4.6677 6.1859 6.61501 6.83370 7.0784 8.2880 10
#> matrix2 2.4428 2.5075 2.84573 2.80790 3.1315 3.5438 10
#> matrix3 35.9857 37.0604 43.86820 41.89480 49.0314 61.3006 10
#> igraph3 5934.9483 6136.9761 6240.44407 6278.98705 6362.0348 6471.6932 10
#> matrix4 11.7949 12.2703 13.64224 13.99900 14.4769 16.3989 10
#> igraph4 184.7337 191.0718 210.06073 203.10085 218.6591 255.5529 10
#> matrix5 6.3713 6.8209 7.23380 7.27330 7.6825 7.9681 10
#> igraph5 26.8565 29.4616 32.98676 33.06670 35.5976 39.0243 10
#> matrix6 6.3562 6.5861 7.14332 6.86025 7.6393 8.6200 10
#> igraph6 10.9108 11.6408 14.39365 12.39585 14.8864 28.6385 10
#> matrix8 1.1340 1.2327 1.33831 1.33730 1.4175 1.5043 10
#> igraph8 2.9498 3.6359 3.95806 3.72040 4.3448 5.2509 10
Run Code Online (Sandbox Code Playgroud)
这种igraph方法始终较慢,有时甚至非常慢。
set.seed(123)
(DF <- tibble(
Sites = 1:100,
NYears =rbinom(100, 40, .2)
) %>%
rowwise() %>%
mutate(Years = list(sample(1982:2021, NYears))) %>%
unnest(Years) %>%
select(-NYears))
Run Code Online (Sandbox Code Playgroud)
这是我最初的暴力方法的(某种程度上)优化版本,它现在不会经历所有可能的年份组合,而是迭代地构建n +1 年内采样的站点列表。
yr.sites <- tapply(DF$Sites, DF$Years, identity, simplify=F)
n.yrs <- length(yr.sites)
# year indices of each combination are stored in an attribute 'yrs'
for (i in seq_len(n.yrs)) attr(yr.sites[[i]], 'yrs') <- i
add_year <- function(ycs) {
ycs <- lapply(ycs, \(sites) {
yrs.comb <- attr(sites, 'yrs')
last.yr <- tail(yrs.comb, 1)
if (last.yr < n.yrs) {
# find subsets of sites sampled also in another year (for all years
# after the last year already considered)
lapply(seq(last.yr+1, n.yrs), \(yr) {
inter <- intersect(sites, yr.sites[[yr]])
# keep only intersections containing two or more sites
if (length(inter) > 1) {
attr(inter, 'yrs') <- c(yrs.comb, yr)
inter
} else NULL
})
} else NULL
})
ycs <- unlist(ycs, recursive=F)
ycs[lengths(ycs) > 0]
}
yrs.comb.sites <- yr.sites
for (i in 2:5) {
yrs.comb.sites <- add_year(yrs.comb.sites)
}
Run Code Online (Sandbox Code Playgroud)
仍然需要大约一秒钟的时间来计算,因此其他答案可能更有效,但与暴力方法相比,显着减少了访问的组合数量。
获奖者是:
best <- unname(yrs.comb.sites[lengths(yrs.comb.sites)==max(lengths(yrs.comb.sites))])
for (i in seq_along(best)) {
attr(best[[i]], 'yrs') <- as.numeric(names(yr.sites))[attr(best[[i]], 'yrs')]
}
best
# [[1]]
# [1] 26 53 84
# attr(,"yrs")
# [1] 1989 1991 1998 2001 2011
#
# [[2]]
# [1] 31 59 67
# attr(,"yrs")
# [1] 1989 1992 1999 2002 2005
Run Code Online (Sandbox Code Playgroud)
这是一种非常慢的暴力方法,@Mark 和 @jblood94 都提供了更有效的解决方案。在我的机器上(老实说,不是一台强大的机器),计算大约需要 80 秒。
yr.site <- tapply(DF$Sites, DF$Years, identity, simplify=F)
year.comb.sites <- combn(yr.site, 5, Reduce, f=intersect, simplify=F)
max.groups <- year.comb.sites[lengths(year.comb.sites) == max(lengths(year.comb.sites))]
max.groups
# [[1]]
# [1] 26 53 84
#
# [[2]]
# [1] 31 59 67
Run Code Online (Sandbox Code Playgroud)
代码很短,但是背后有大量的计算。yr.sites是存储每年采样站点的列表。该combn行生成长度为 5 的年份的所有组合(有 658008 个这样的组合),并为每个组合查找所有这些年份中采样的站点。最终有2组最多3个满足初始条件的站点。
我们可以通过以下方式确认这一点
sort(table(subset(DF, Sites %in% max.groups[[1]])$Years))
# 1984 1988 1990 1993 1994 1996 1999 2000 2003 2005 2010 2017 2018 2021 1989 1991 1998 2001 2011
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
Run Code Online (Sandbox Code Playgroud)
这表明,例如第一组中的所有三个地点确实是在 1989 年、1991 年、1998 年、2001 年和 2011 年采样的。
正如我已经说过的,如果您想要请求超过 5 年的共享时间,此解决方案并不有效,并且无法很好地扩展。例如,在 6 年的情况下,组合数 ( choose(40, 6)) 将增加到 3838380 个,更高的年数也不是好消息。
igraph解决方案的后续:分解代码我不希望我的igraph答案太长而难以阅读,因此我创建了一个新答案来详细说明该解决方案的工作原理。
为简单起见,我们可以从一个较小的虚拟示例开始,我们假设我们的目标是找到至少years的最大份额Sites3。
set.seed(0)
DF <- data.frame(
Sites = rep(1:10, each = 10),
Years = sample(2000:2020, 100, replace = TRUE)
) %>%
unique()
Run Code Online (Sandbox Code Playgroud)
图表中的可视化DF看起来可以通过
DF %>%
graph_from_data_frame() %>%
set_vertex_attr(name = "color", value = names(V(.)) %in% DF$Sites) %>%
plot()
Run Code Online (Sandbox Code Playgroud)
1)。正如我们所看到的,该图可以被视为二分图(Sites并且Years是两种不同类型的顶点),并且共享Years实际上可以投影到边。由于我们需要跟踪共享的数量Years,因此我们可以使用边缘属性"weight"来识别共享的计数Years。Sites在这种情况下,在投影之前需要一个 的邻接矩阵table+tcrossprod,例如,
adjmat <- DF %>%
table() %>%
tcrossprod()
Run Code Online (Sandbox Code Playgroud)
这使
> adjmat
Sites
Sites 1 2 3 4 5 6 7 8 9 10
1 8 2 3 0 4 4 4 2 5 3
2 2 7 5 2 3 1 2 3 2 4
3 3 5 9 4 4 3 2 3 2 4
4 0 2 4 6 3 2 1 2 1 3
5 4 3 4 3 9 5 4 4 4 4
6 4 1 3 2 5 7 2 3 2 3
7 4 2 2 1 4 2 9 4 3 4
8 2 3 3 2 4 3 4 9 2 5
9 5 2 2 1 4 2 3 2 7 4
10 3 4 4 3 4 3 4 5 4 9
Run Code Online (Sandbox Code Playgroud)
2)。正如目标中所述,我们想要找出拥有>=3共同年份的群体,这意味着"weight"边数至少应为3。在步骤1)得到的邻接矩阵之上adjmat,我们可以进一步应用过滤器(>=3)来简化矩阵,这相当于对网络进行剪枝,即
g <- adjmat %>%
`>=`(3) %>%
graph_from_adjacency_matrix(mode = "undirected", diag = FALSE)
Run Code Online (Sandbox Code Playgroud)
3)。我们知道,Sites共享相同的Years应该产生一个完整的子图,即clique。因此,我们可以枚举图中的所有派系g,这是由函数 完成的cliques(),即
clq <- g %>%
cliques(min = 2)
Run Code Online (Sandbox Code Playgroud)
其中min = 2指定每个派系的最小大小。如果您事先没有有关大小的信息,也可以不使用cliques()任何其他参数来使用。现在,clq是一个看起来像的列表
> clq
[[1]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 10
[[2]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 5
[[3]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 10
[[4]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 10
[[5]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 7
[[6]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 10
[[7]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 10
[[8]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 8
[[9]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 8
[[10]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 7 8 10
[[11]]
+ 3/10 vertices, named, from 3b71441:
[1] 7 8 10
[[12]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 8
[[13]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 8
[[14]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 8 10
[[15]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 8 10
[[16]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 8
[[17]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 8 10
[[18]]
+ 2/10 vertices, named, from 3b71441:
[1] 8 10
[[19]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 7
[[20]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 7
[[21]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 10
[[22]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 10
[[23]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 3
[[24]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 5
[[25]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 10
[[26]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 10
[[27]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 5
[[28]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 10
[[29]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 10
[[30]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 4
[[31]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 5
[[32]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 4 5 10
[[33]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 10
[[34]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 5
[[35]]
+ 3/10 vertices, named, from 3b71441:
[1] 4 5 10
[[36]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 10
[[37]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 9
[[38]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 9
[[39]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 9
[[40]]
+ 5/10 vertices, named, from 3b71441:
[1] 1 5 7 9 10
[[41]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 7 9 10
[[42]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 9
[[43]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 9 10
[[44]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 9 10
[[45]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 9
[[46]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 9
[[47]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 7 9 10
[[48]]
+ 3/10 vertices, named, from 3b71441:
[1] 7 9 10
[[49]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 9
[[50]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 9 10
[[51]]
+ 2/10 vertices, named, from 3b71441:
[1] 9 10
[[52]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 6
[[53]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 6
[[54]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 6
[[55]]
+ 5/10 vertices, named, from 3b71441:
[1] 1 3 5 6 10
[[56]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 6 10
[[57]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 6
[[58]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 6 10
[[59]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 6 10
[[60]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 8
[[61]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 8
[[62]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 8
[[63]]
+ 5/10 vertices, named, from 3b71441:
[1] 3 5 6 8 10
[[64]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 6 8 10
[[65]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 8
[[66]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 6 8 10
[[67]]
+ 3/10 vertices, named, from 3b71441:
[1] 6 8 10
[[68]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 6
[[69]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 6
[[70]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 10
[[71]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 10
[[72]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 6
[[73]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 10
[[74]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 10
[[75]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 8
[[76]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 8
[[77]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 8
[[78]]
+ 5/10 vertices, named, from 3b71441:
[1] 2 3 5 8 10
[[79]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 8 10
[[80]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 8
[[81]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 5 8 10
[[82]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 8 10
[[83]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 3
[[84]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 5
[[85]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 10
[[86]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 10
[[87]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 5
[[88]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 10
[[89]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 10
Run Code Online (Sandbox Code Playgroud)
4). 需要注意的是,并不是所有的派系都满足>=3共享的要求Years,因为“权重”表示的是总共享的数量Years,而不是区分共享的数量Years。换句话说,2000, 2001, 2002有效但2000, 2000, 2002无效,尽管后者的计数为3。因此,我们需要检查每个派系中共享的分布Years。
例如,如果我们查看第三个派系,即 ,clq[[3]]并检查共享的分布Years
> q <- clq[[3]]
> table(subset(DF, Sites %in% names(q)))
Years
Sites 2000 2001 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
3 1 1 1 0 0 0 1 1 0 1 0 1 1 0
5 0 1 0 1 1 1 0 1 0 1 1 1 0 0
10 1 0 0 0 1 1 0 1 1 1 0 0 1 1
Years
Sites 2018 2019 2020
3 0 1 0
5 1 0 0
10 0 0 1
Run Code Online (Sandbox Code Playgroud)
我们看到所有1s 的列都是2009和2011,这意味着它们 只有 2 个共享Years,因此无效。要选择有效的派系,我们可以使用Filter过滤标准,例如,
validclq <- clq %>%
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 3
}, .
)
Run Code Online (Sandbox Code Playgroud)
我们将会看到
> validclq
[[1]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 10
[[2]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 5
[[3]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 10
[[4]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 7
[[5]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 10
[[6]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 8
[[7]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 8
[[8]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 8
[[9]]
+ 2/10 vertices, named, from 9bd9430:
[1] 8 10
[[10]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 7
[[11]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 3
[[12]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 5
[[13]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 10
[[14]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 4
[[15]]
+ 3/10 vertices, named, from 9bd9430:
[1] 3 4 10
[[16]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 5
[[17]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 10
[[18]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 9
[[19]]
+ 3/10 vertices, named, from 9bd9430:
[1] 1 5 9
[[20]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 9
[[21]]
+ 3/10 vertices, named, from 9bd9430:
[1] 7 9 10
[[22]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 9
[[23]]
+ 2/10 vertices, named, from 9bd9430:
[1] 9 10
[[24]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 6
[[25]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 8
[[26]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 6
[[27]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 6
[[28]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 10
[[29]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 8
[[30]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 3
[[31]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 5
[[32]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 10
Run Code Online (Sandbox Code Playgroud)
其中 89 个派系中有 32 个满足要求。
5)。作为最后一步,我们从有效的 cliques 中选择具有最大尺寸的 cliques(因为我们正在寻找 的最大数量。使用获取每个团的大小,我们可以过滤最大的团,例如,Sitesvalidclqlenghts
res <- validclq %>%
`[`(lengths(.) == max(lengths(.)))
Run Code Online (Sandbox Code Playgroud)
我们终于得到了
[[1]]
+ 3/10 vertices, named, from 75803c6:
[1] 3 4 10
[[2]]
+ 3/10 vertices, named, from 75803c6:
[1] 1 5 9
[[3]]
+ 3/10 vertices, named, from 75803c6:
[1] 7 9 10
Run Code Online (Sandbox Code Playgroud)