Nan*_*ami 2 optimization loops if-statement r vectorization
我给了两个非常大的数据集,我一直在尝试构建一个函数,该函数可以从一个集合中找到某个坐标,该坐标尊重关于其他数据集的if子句.我的问题是我写的函数非常慢,虽然我一直在以某种方式阅读类似问题的答案,但我还是没有成功.
所以如果给我:
>head(CTSS)
V1 V2 V3
1 chr1 564563 564598
2 chr1 564620 564649
3 chr1 565369 565404
4 chr1 565463 565541
5 chr1 565653 565697
6 chr1 565861 565922
Run Code Online (Sandbox Code Playgroud)
和
> head(href)
chr region start end strand nu gene_id transcript_id
1 chr1 start_codon 67000042 67000044 + . NM_032291 NM_032291
2 chr1 CDS 67000042 67000051 + 0 NM_032291 NM_032291
3 chr1 exon 66999825 67000051 + . NM_032291 NM_032291
4 chr1 CDS 67091530 67091593 + 2 NM_032291 NM_032291
5 chr1 exon 67091530 67091593 + . NM_032291 NM_032291
6 chr1 CDS 67098753 67098777 + 1 NM_032291 NM_032291
Run Code Online (Sandbox Code Playgroud)
对于在每个值起始列从HREF数据集我想找到的CTSS数据的第三列的前两个值设置小于或大于等于它,并把它放在一个新的数据帧.
我写的循环:
y <- CTSS[order(-CTSS$V3), ]
find_CTSS <- function(x, y) {
n <- length(x$start)
foo <- data.frame(matrix(0, n, 6))
for (i in 1:n)
{
a <- which(y$V3 <= x$start[i])
foo[i, ] = c(x$start[i], x$stop[i], y$V2[a[1]], y$V3[a[1]] , y$V2[a[2]], y$V3[a[2]])
}
print(foo)
}
Run Code Online (Sandbox Code Playgroud)
您提供的数据很少(但请参见此处),因此对您的解决方案进行基准测试有点困难.查看以下解决方案是否满足您的需求.
#make some fake data
href <- data.frame(start = runif(10), stop = runif(10), other_col = sample(letters, 10))
CTSS <- data.frame(col1 = runif(100), col2 = runif(100))
# for each row in href (but extract only stop and start columns)
result <- apply(X = href[, c("start", "stop")], MARGIN = 1, FUN = function(x, ctss) {
criterion <- x["start"] #make a criterion
#see which values are smaller or equal to this criterion (and sort them)
extracted <- sort(ctss[ctss$col2 <= criterion, "col2"])
#extract last and one to last value
get.values <- extracted[c(length(extracted) - 1, length(extracted))]
#put values in data frame
out <- as.data.frame(matrix(get.values, ncol = 2))
return(out)
}, ctss = CTSS)
#pancake a list into a data.frame
result <- do.call("rbind", result)
Run Code Online (Sandbox Code Playgroud)