LDT*_*LDT 6 r intervals dplyr data.table genomicranges
我有一个巨大的数据框,如下所示。
\n我想要group_by(chr),然后为每个人chr找到
library(dplyr)\n\ndf1 <- tibble(chr=c(1,1,2,2),\n start1=c(100,200,100,200),\n end1=c(150,400,150,400),\n species=c("Penguin"), \n start2=c(200,200,500,1000), \n end2=c(250,240,1000,2000)\n )\n\ndf1\n#> # A tibble: 4 \xc3\x97 6\n#> chr start1 end1 species start2 end2\n#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>\n#> 1 1 100 150 Penguin 200 250\n#> 2 1 200 400 Penguin 200 240\n#> 3 2 100 150 Penguin 500 1000\n#> 4 2 200 400 Penguin 1000 2000\nRun Code Online (Sandbox Code Playgroud)\n创建于 2023-01-05,使用reprex v2.0.2
\n我希望我的数据看起来像这样。\n本质上我想检查 range2 是否与任何 range1 重叠。\n新数据不会改变问题,但证明会检查代码
\n# A tibble: 4 \xc3\x97 6\n chr start1 end1 species start2 end2 OVERLAP\n 1 100 150 Penguin 200 250 TRUE\n 1 200 400 Penguin 200 240 TRUE\n 2 100 150 Penguin 500 1000 FALSE\n 2 200 400 Penguin 1000 2000 FALSE\nRun Code Online (Sandbox Code Playgroud)\nivs我与这个包裹进行了很多斗争,但iv_overlaps没有成功地得到我想要的东西。
主要编辑:
当我在实际数据中应用任何代码时,我没有得到我想要的结果,我很困惑。为什么?\n新的数据集并没有改变问题,但证明检查了代码
\ndata <- tibble::tribble(\n ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,\n "Chr2", 2739, 2840, "+", "A", 740, 1739,\n "Chr2", 12577, 12678, "+", "B", 10578, 11577,\n "Chr2", 22431, 22532, "+", "C", 20432, 21431,\n "Chr2", 32202, 32303, "+", "D", 30203, 31202,\n "Chr2", 42024, 42125, "+", "E", 40025, 41024,\n "Chr2", 51830, 51931, "+", "F", 49831, 50830,\n "Chr2", 82061, 84742, "+", "G", 80062, 81061,\n "Chr2", 84811, 86692, "+", "H", 82812, 83811,\n "Chr2", 86782, 88106, "-", "I", 88107, 89106,\n "Chr2", 139454, 139555, "+", "J", 137455, 138454,\n )\n\ndata %>% \n group_by(chr) %>% \n mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))\nRun Code Online (Sandbox Code Playgroud)\n然后它作为输出给出
\n chr start1 end1 strand gene start2 end2 overlap\n <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <lgl> \n 1 Chr2 2739 2840 + A 740 1739 TRUE \n 2 Chr2 12577 12678 + B 10578 11577 TRUE \n 3 Chr2 22431 22532 + C 20432 21431 TRUE \n 4 Chr2 32202 32303 + D 30203 31202 TRUE \n 5 Chr2 42024 42125 + E 40025 41024 TRUE \n 6 Chr2 51830 51931 + F 49831 50830 TRUE \n 7 Chr2 82061 84742 + G 80062 81061 TRUE \n 8 Chr2 84811 86692 + H 82812 83811 TRUE \n 9 Chr2 86782 88106 - I 88107 89106 TRUE \n10 Chr2 139454 139555 + J 137455 138454 TRUE\nRun Code Online (Sandbox Code Playgroud)\n这是错误的。它们可能是间接匹配,但不存在直接重叠。
\n场景 1:逐元素检测重叠
\nlibrary(dplyr)\n\ndf1 %>%\n group_by(chr) %>%\n mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%\n ungroup()\n\n# # A tibble: 4 \xc3\x97 7\n# chr start1 end1 species start2 end2 OVERLAP\n# <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> \n# 1 1 100 150 Penguin 200 250 TRUE \n# 2 1 200 400 Penguin 200 240 TRUE \n# 3 2 100 150 Penguin 500 1000 FALSE \n# 4 2 200 400 Penguin 1000 2000 FALSE\nRun Code Online (Sandbox Code Playgroud)\n场景 2:按元素检测重叠排序
\n如果间隔是有向的,即end可以小于start,那么在确定重叠之前需要进行排序。
df1 %>%\n group_by(chr) %>%\n mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &\n pmax(start1, end1) >= pmin(start2, end2)))\nRun Code Online (Sandbox Code Playgroud)\n场景3:与排序重叠的交叉检测
\n此外,如果您想检查一个间隔是否(start1, end1)与任何间隔重叠(start2, end2)(这ivs::iv_overlaps()有效),那么您可以使用 来实现它purrr::map2。
df1 %>%\n group_by(chr) %>%\n mutate(OVERLAP = any(\n purrr::map2_lgl(start1, end1,\n ~ any(min(.x, .y) <= pmax(start2, end2) &\n max(.x, .y) >= pmin(start2, end2)))\n ))\nRun Code Online (Sandbox Code Playgroud)\n
您的问题有多种解释,因此以下是三种可能的情况:
\n[start1, end1]是否与任何一个重叠[start2, end2]。[start1, end1]与 中的任何重叠[start2, end2]。[start1, end1]与其对应的[start2, end2](同一行上的)重叠。在这三种情况下,您可以使用ivs::iv_overlaps.
情况1
\niv_overlaps将检测每个组内定义的间隔是否[start1, end1]以任何方式与任何间隔重叠[start2, end2]。它将返回长度为 的逻辑向量[start1, end1]。
library(ivs)\nlibrary(dplyr)\ndf1 %>% \n group_by(chr) %>% \n mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))\n\n# A tibble: 4 \xc3\x97 7\n# Groups: chr [2]\n chr start1 end1 species start2 end2 overlap\n <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> \n1 1 100 150 Penguin 200 250 FALSE \n2 1 200 400 Penguin 160 170 TRUE \n3 2 100 150 Penguin 500 1000 FALSE \n4 2 200 400 Penguin 1000 2000 FALSE \nRun Code Online (Sandbox Code Playgroud)\n案例2
\n如果您想知道间隔 1 中的任何一个(不是每个)是否与间隔 2 中的任何一个重叠(因此每组有一个唯一值),您应该添加any:
df1 %>% \n group_by(chr) %>% \n mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))\n\n# A tibble: 4 \xc3\x97 7\n# Groups: chr [2]\n chr start1 end1 species start2 end2 overlap\n <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> \n1 1 100 150 Penguin 200 250 TRUE \n2 1 200 400 Penguin 160 170 TRUE \n3 2 100 150 Penguin 500 1000 FALSE \n4 2 200 400 Penguin 1000 2000 FALSE \nRun Code Online (Sandbox Code Playgroud)\n案例3
\n如果您想要行重叠检测,那么您应该使用map2with iv_overlaps:
df1 %>% \n group_by(chr) %>% \n mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))\n\n# A tibble: 4 \xc3\x97 7\n# Groups: chr [2]\n chr start1 end1 species start2 end2 overlap\n <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> \n1 1 100 150 Penguin 200 250 FALSE \n2 1 200 400 Penguin 160 170 FALSE \n3 2 100 150 Penguin 500 1000 FALSE \n4 2 200 400 Penguin 1000 2000 FALSE \nRun Code Online (Sandbox Code Playgroud)\n比较顺序
\n事实上,如果想将第二个间隔与第一个间隔进行比较,应该使用iv_overlaps(interval2, interval1):
# A tibble: 4 \xc3\x97 7\n# Groups: chr [2]\n chr start1 end1 species start2 end2 overlap\n <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl> \n1 1 100 150 Penguin 200 250 TRUE \n2 1 200 400 Penguin 160 170 FALSE \n3 2 100 150 Penguin 500 1000 FALSE \n4 2 200 400 Penguin 1000 2000 FALSE \nRun Code Online (Sandbox Code Playgroud)\n数据
\ndf1 <- tibble(chr=c(1,1,2,2), start1=c(100,200,100,200), end1=c(150,400,150,400), species=c("Penguin"), start2=c(200,160,500,1000), end2=c(250,170,1000,2000) )\nRun Code Online (Sandbox Code Playgroud)\n