从R中的一列中选择共享另一列中的值的最大数量

Question

从R中的一列中选择共享另一列中的值的最大数量

mfe*_*ira 19 algorithm performance r igraph dataframe

我有一个大型数据集，其中包含 40 年来不定期采样的站点。我想选择共享的最大站点数，让\xe2\x80\x99s 说，至少 5 年的数据。

\n

任何指示将不胜感激。

\n

Here\xe2\x80\x99s 是一个示例数据集：

\n

library(tidyverse)\n\nset.seed(123)\n\nDF <- tibble(\n  Sites = 1:100,\n  NYears = rbinom(100, 40, .2)\n  ) %>%\n  rowwise() %>%\n  mutate(Years = list(sample(1982:2021, NYears))) %>%\n  unnest(Years) %>%\n  select(-NYears)\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

Mar*_*ark 10

这是一个 na\xc3\xafve 解决方案，虽然它可能不适用于较大的数据集，但可能是寻找更好解决方案的良好起点：

\n

library(tidyverse)\n\nset.seed(123)\n\n# I create the dataset myself, because I don't want it to be unnested\nDF <- tibble(\n  Sites = 1:100,\n  NYears =rbinom(100, 40, .2)\n  ) %>%\n  rowwise() %>%\n  mutate(Years = list(sort(sample(1982:2021, NYears)))) # sorting the years is good for later when I want to find the combinations, I can be sure that they will be in the same order\n\n# basically, we're doing a crossjoin, filtering to overlaps larger than 5, then generating all possible combinations of those overlaps\noverlaps <- cross_join(DF, DF) %>%\n  filter(Sites.x < Sites.y) %>%\n  mutate(Overlap = list(intersect(Years.x, Years.y))) %>%\n  filter(length(Overlap) >= 5) %>%\n  mutate(combinations = list(combn(Overlap, 5, simplify = FALSE))) %>%\n  select(combinations, Sites.x, Sites.y) %>% \n  unnest(combinations)\n\nmost_common_fives <- overlaps %>%\n  count(combinations) %>%\n  slice_max(n) %>%\n  pull(combinations)\n\noverlaps %>%\n    filter(combinations %in% most_common_fives) %>%\n    group_by(combinations) %>%\n    summarise(values = (list(unique(c(Sites.x, c(Sites.y)))))) %>%\n    pull(combinations, values) \n\n$`c(26, 53, 84)`\n[1] 1989 1991 1998 2001 2011\n\n$`c(31, 59, 67)`\n[1] 1989 1992 1999 2002 2005\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 2

Tho*_*ing 10

免责声明

下面的方法不如@jblood94的解决方案有效（因此，如果您追求速度，请不要将我的解决方案用于大型数据集），而只是通过以图论的方式思考来改变思维方式并探索使用igraph来解决问题的可能性。

简要想法（图论视角）

总的来说，我认为这个问题可以用图论的方式来处理，用来解决igraph。如果你追求效率，你可能需要探索隐藏在图表背后的潜在属性。例如：

共享数量Years可以解释为与两个顶点关联的边权重Sites。
<=4此外，由于在搜索派系时可以跳过一些具有权重的边，因此可以进一步简化图。修剪网络并随后搜索应该比迭代所有可能的组合更有效。

如果您对详细信息感兴趣，请参阅后续答案和代码细分。

一种`igraph`方法

下面可能是igraph解决该问题的一种选择（有关详细信息，请参阅代码注释）：您可以尝试graph_from_adjacency_matrix并Sites使用找到派系cliques()，例如，

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
    graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
    ) %>%
    # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
    subgraph.edges(E(.)[E(.)$weight > 4]) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

归档时间：	2 年，5 月前
查看次数：	589 次
最近记录：	1 年，11 月前

从R中的一列中选择共享另一列中的值的最大数量

免责声明

简要想法（图论视角）

一种igraph方法

讨论

标杆管理

数据

原答案

igraph解决方案的后续：分解代码

虚拟示例

脚步

一种`igraph`方法

`igraph`解决方案的后续：分解代码