是否可以使用R data.table函数foverlaps来查找两个表中重叠范围的交集?

Pet*_*ete 11 r bioinformatics data.table

我想使用foverlaps查找两个床文件的交叉范围,并将包含重叠范围的任何行折叠成一行.在下面的例子中,我有两个基因组范围表.这些表被称为"床"文件,其具有从零开始的坐标和染色体中特征的基于一个的结束位置.例如,START = 9,STOP = 20被解释为跨越10到20的基数,包括10和20.这些床文件可以包含数百万行.无论提供两个要交叉的文件的顺序如何,解决方案都需要提供相同的结果.

第一张表

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200
Run Code Online (Sandbox Code Playgroud)

第二张表

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50
Run Code Online (Sandbox Code Playgroud)

我认为新的foverlaps函数可以是一种非常快速的方法来查找这两个表中的交叉范围,以生成一个表格,如下所示:

结果表:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  
Run Code Online (Sandbox Code Playgroud)

这是可能的,还是有更好的方法在data.table中做到这一点?

我还想首先确认在一个表中,对于任何给定的CHROMOSOME,STOP坐标不与下一行的起始坐标重叠.例如,染色体Y:1-15和染色体Y:10-50需要折叠到染色体Y:1-50(参见第二表第7和第8行).情况应该不是这样,但功能应该检查一下.下面将展示潜在重叠如何崩溃的真实例子:

CHROM START STOP 1:1 721281 721619 2:1 721430 721906 3:1 721751 722042

期望的输出:

CHROM START STOP 1:1 721281 722042

创建示例表的函数如下:

   CHROM  START   STOP
1:     1 721281 721619
2:     1 721430 721906
3:     1 721751 722042
Run Code Online (Sandbox Code Playgroud)

Pet*_*ete 10

@Seth提供了使用data.table foverlaps函数解决交集重叠问题的最快方法.然而,该解决方案没有考虑输入床文件可能具有需要被缩减为单个区域的重叠范围的事实.@Martin Morgan用他的解决方案使用GenomicRanges软件包解决了这个问题,同时减少了交叉和范围.但是,Martin的解决方案没有使用foverlaps功能.@Arun指出,使用foverlaps时,表中的不同行中的重叠范围目前是不可能的.感谢提供的答案,以及对stackoverflow的一些额外研究,我提出了这种混合解决方案.

创建示例BED文件,每个文件中没有重叠区域.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)
Run Code Online (Sandbox Code Playgroud)

现在创建一个新的床文件,其中包含bedA和bedB中的交叉区域.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)

现在,修改bedA和bedB,使它们包含重叠区域:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Run Code Online (Sandbox Code Playgroud)

使用foverlaps或GenomicRanges功能测试具有重叠范围的床文件的交叉时间.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)

获胜者:foverlaps!

使用的功能

这是基于foverlaps的函数,如果存在重叠范围(使用rowShift函数检查),则只调用GenomicRanges函数(reduceBed.GenomicRanges).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}
Run Code Online (Sandbox Code Playgroud)

此函数严格使用GenomicRanges包,产生相同的结果,但比foverlaps功能慢约10倍.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}
Run Code Online (Sandbox Code Playgroud)

使用IRanges的额外比较

我找到了使用IRanges折叠重叠区域的解决方案,但它比GenomicRanges慢10倍.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[,-c(1:2),with=F])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE
Run Code Online (Sandbox Code Playgroud)