在 R 中分组后查找哪些列范围重叠

LDT*_*LDT 6 r intervals dplyr data.table genomicranges

我有一个巨大的数据框,如下所示。

\n

我想要group_by(chr),然后为每个人chr找到

\n
    \n
  • chr 组内的任何 range1 (start1, end1) 是否与任何 range2 (start2,end2) 重叠?
  • \n
\n
library(dplyr)\n\ndf1 <- tibble(chr=c(1,1,2,2),\n               start1=c(100,200,100,200),\n               end1=c(150,400,150,400),\n       species=c("Penguin"), \n       start2=c(200,200,500,1000), \n       end2=c(250,240,1000,2000)\n       )\n\ndf1\n#> # A tibble: 4 \xc3\x97 6\n#>     chr start1  end1 species start2  end2\n#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>\n#> 1     1    100   150 Penguin    200   250\n#> 2     1    200   400 Penguin    200   240\n#> 3     2    100   150 Penguin    500  1000\n#> 4     2    200   400 Penguin   1000  2000\n
Run Code Online (Sandbox Code Playgroud)\n

创建于 2023-01-05,使用reprex v2.0.2

\n

我希望我的数据看起来像这样。\n本质上我想检查 range2 是否与任何 range1 重叠。\n新数据不会改变问题,但证明会检查代码

\n
# A tibble: 4 \xc3\x97 6\n        chr start1  end1 species start2  end2 OVERLAP\n         1    100   150 Penguin    200   250    TRUE\n         1    200   400 Penguin    200   240    TRUE\n         2    100   150 Penguin    500  1000    FALSE\n         2    200   400 Penguin   1000  2000    FALSE\n
Run Code Online (Sandbox Code Playgroud)\n

ivs我与这个包裹进行了很多斗争,但iv_overlaps没有成功地得到我想要的东西。

\n

主要编辑:

\n
\n

当我在实际数据中应用任何代码时,我没有得到我想要的结果,我很困惑。为什么?\n新的数据集并没有改变问题,但证明检查了代码

\n
data <- tibble::tribble(\n  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,\n  "Chr2",   2739,   2840, "+", "A",    740,   1739,\n  "Chr2",  12577,  12678, "+", "B",  10578,  11577,\n  "Chr2",  22431,  22532, "+", "C",  20432,  21431,\n  "Chr2",  32202,  32303, "+", "D",  30203,  31202,\n  "Chr2",  42024,  42125, "+", "E",  40025,  41024,\n  "Chr2",  51830,  51931, "+", "F",  49831,  50830,\n  "Chr2",  82061,  84742, "+", "G",  80062,  81061,\n  "Chr2",  84811,  86692, "+", "H",  82812,  83811,\n  "Chr2",  86782,  88106, "-", "I",  88107,  89106,\n  "Chr2", 139454, 139555, "+", "J", 137455, 138454,\n  )\n\ndata %>% \n  group_by(chr) %>% \n  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))\n
Run Code Online (Sandbox Code Playgroud)\n

然后它作为输出给出

\n
 chr   start1   end1 strand gene  start2   end2 overlap\n   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  \n 1 Chr2    2739   2840 +      A        740   1739 TRUE   \n 2 Chr2   12577  12678 +      B      10578  11577 TRUE   \n 3 Chr2   22431  22532 +      C      20432  21431 TRUE   \n 4 Chr2   32202  32303 +      D      30203  31202 TRUE   \n 5 Chr2   42024  42125 +      E      40025  41024 TRUE   \n 6 Chr2   51830  51931 +      F      49831  50830 TRUE   \n 7 Chr2   82061  84742 +      G      80062  81061 TRUE   \n 8 Chr2   84811  86692 +      H      82812  83811 TRUE   \n 9 Chr2   86782  88106 -      I      88107  89106 TRUE   \n10 Chr2  139454 139555 +      J     137455 138454 TRUE\n
Run Code Online (Sandbox Code Playgroud)\n

这是错误的。它们可能是间接匹配,但不存在直接重叠。

\n

Dar*_*sai 5

场景 1:逐元素检测重叠

\n
library(dplyr)\n\ndf1 %>%\n  group_by(chr) %>%\n  mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%\n  ungroup()\n\n# # A tibble: 4 \xc3\x97 7\n#     chr start1  end1 species start2  end2 OVERLAP\n#   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  \n# 1     1    100   150 Penguin    200   250 TRUE   \n# 2     1    200   400 Penguin    200   240 TRUE   \n# 3     2    100   150 Penguin    500  1000 FALSE  \n# 4     2    200   400 Penguin   1000  2000 FALSE\n
Run Code Online (Sandbox Code Playgroud)\n
\n

场景 2:按元素检测重叠排序

\n

如果间隔是有向的,即end可以小于start,那么在确定重叠之前需要进行排序。

\n
df1 %>%\n  group_by(chr) %>%\n  mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &\n                       pmax(start1, end1) >= pmin(start2, end2)))\n
Run Code Online (Sandbox Code Playgroud)\n
\n

场景3:与排序重叠的交叉检测

\n

此外,如果您想检查一个间隔是否(start1, end1)与任何间隔重叠(start2, end2)(这ivs::iv_overlaps()有效),那么您可以使用 来实现它purrr::map2

\n
df1 %>%\n  group_by(chr) %>%\n  mutate(OVERLAP = any(\n    purrr::map2_lgl(start1, end1,\n                    ~ any(min(.x, .y) <= pmax(start2, end2) &\n                          max(.x, .y) >= pmin(start2, end2)))\n  ))\n
Run Code Online (Sandbox Code Playgroud)\n


Maë*_*aël 5

您的问题有多种解释,因此以下是三种可能的情况:

\n
    \n
  1. 在一组内,检测每个组 [start1, end1]是否与任何一个重叠[start2, end2]
  2. \n
  3. 在组内,检测是否有任何[start1, end1]与 中的任何重叠[start2, end2]
  4. \n
  5. 在一组内,检测每个是否[start1, end1]与其对应的[start2, end2](同一行上的)重叠。
  6. \n
\n

在这三种情况下,您可以使用ivs::iv_overlaps.

\n
\n

情况1

\n

iv_overlaps将检测每个组内定义的间隔是否[start1, end1]以任何方式与任何间隔重叠[start2, end2]。它将返回长度为 的逻辑向量[start1, end1]

\n
library(ivs)\nlibrary(dplyr)\ndf1 %>% \n  group_by(chr) %>% \n  mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))\n\n# A tibble: 4 \xc3\x97 7\n# Groups:   chr [2]\n    chr start1  end1 species start2  end2 overlap\n  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  \n1     1    100   150 Penguin    200   250 FALSE  \n2     1    200   400 Penguin    160   170 TRUE   \n3     2    100   150 Penguin    500  1000 FALSE  \n4     2    200   400 Penguin   1000  2000 FALSE  \n
Run Code Online (Sandbox Code Playgroud)\n
\n

案例2

\n

如果您想知道间隔 1 中的任何一个(不是每个)是否与间隔 2 中的任何一个重叠(因此每组有一个唯一值),您应该添加any

\n
df1 %>% \n  group_by(chr) %>% \n  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))\n\n# A tibble: 4 \xc3\x97 7\n# Groups:   chr [2]\n    chr start1  end1 species start2  end2 overlap\n  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  \n1     1    100   150 Penguin    200   250 TRUE   \n2     1    200   400 Penguin    160   170 TRUE   \n3     2    100   150 Penguin    500  1000 FALSE  \n4     2    200   400 Penguin   1000  2000 FALSE  \n
Run Code Online (Sandbox Code Playgroud)\n
\n

案例3

\n

如果您想要行重叠检测,那么您应该使用map2with iv_overlaps

\n
df1 %>% \n  group_by(chr) %>% \n  mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))\n\n# A tibble: 4 \xc3\x97 7\n# Groups:   chr [2]\n    chr start1  end1 species start2  end2 overlap\n  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  \n1     1    100   150 Penguin    200   250 FALSE  \n2     1    200   400 Penguin    160   170 FALSE  \n3     2    100   150 Penguin    500  1000 FALSE  \n4     2    200   400 Penguin   1000  2000 FALSE  \n
Run Code Online (Sandbox Code Playgroud)\n
\n

比较顺序

\n

事实上,如果想将第二个间隔与第一个间隔进行比较,应该使用iv_overlaps(interval2, interval1)

\n
# A tibble: 4 \xc3\x97 7\n# Groups:   chr [2]\n    chr start1  end1 species start2  end2 overlap\n  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  \n1     1    100   150 Penguin    200   250 TRUE   \n2     1    200   400 Penguin    160   170 FALSE  \n3     2    100   150 Penguin    500  1000 FALSE  \n4     2    200   400 Penguin   1000  2000 FALSE  \n
Run Code Online (Sandbox Code Playgroud)\n
\n

数据

\n
df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )\n
Run Code Online (Sandbox Code Playgroud)\n