Eva*_*nig 12 algorithm performance r
我想问一下R中特定编码问题的效率建议.我有一个以下样式的字符串向量:
[1] "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1/1;CANONICAL=YES"
[2] "DISTANCE=2179"
[3] "HGVSc=ENST00000466430.1:n.911C>T;EXON=4/4;CANONICAL=YES"
[4] "DISTANCE=27;CANONICAL=YES;common"
Run Code Online (Sandbox Code Playgroud)
在向量的每个元素中,单个条目用a ;和MOST 分隔,单个条目具有格式KEY=VALUE.但是,也有一些条目只有格式KEY(参见[4]中的"common").在此示例中,有15个不同的键,并不是每个键都出现在向量的每个元素中.15个不同的键是:
names <- c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common')
Run Code Online (Sandbox Code Playgroud)
从这个向量我想创建一个如下所示的数据帧:
ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc CANONICAL
1 - - - - 1/1 - ENST00000495576.1:n.820-1G>A YES
2 - - - - - - - -
3 - - - 4/4 - - ENST00000466430.1:n.911C>T YES
4 - - - - - - - YES
GMAF DISTANCE HGNC CCDS SIFT PolyPhen common
1 - - - - - - -
2 - 2179 - - - - -
3 - - - - - - -
4 - 27 - - - - YES
Run Code Online (Sandbox Code Playgroud)
我写了这个函数来解决问题:
unlist.info <- function(names, column){
info.mat <- matrix(rep('-', length(column)*length(names)), nrow=length(column), ncol=length(names), dimnames=list(c(), names))
info.mat <- as.data.frame(info.mat, stringsAsFactors=F)
for (i in 1:length(column)){
info <- unlist(strsplit(column[i], "\\;"))
for (e in info){
e <- unlist(strsplit(e, "\\="))
j <- which(names == e[1])
if (length(e) > 1){
# KEY=VALUE. The value might contain a = as well
value <- paste(e[2:length(e)], collapse='=')
info.mat[i,j] <- value
}else{
# only KEY
info.mat[i,j] <- 'YES'
}
}
}
return(info.mat)
}
Run Code Online (Sandbox Code Playgroud)
然后我打电话给:
mat <- unlist.info(names, vector)
Run Code Online (Sandbox Code Playgroud)
尽管这很有效,但它确实很慢.此外,我正在处理超过100,000条目的向量.现在我意识到循环在R中是不优雅和低效的,我熟悉将函数应用于数据帧的概念.但是,由于向量的每个条目都包含不同的子集KEY=VALUE或KEY条目,因此无法提供更高效的函数.
And*_*rie 11
干得好:
重新创建数据:
x <- c(
"HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES",
"DISTANCE=2179",
"HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES",
"DISTANCE=27;CANONICAL=YES;common"
)
Run Code Online (Sandbox Code Playgroud)
使用所需名称创建命名向量.这用于以后快速查找:
names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))
Run Code Online (Sandbox Code Playgroud)
创建一个辅助函数,将每个变量分配给矩阵中的正确位置.然后使用lapply和strsplit:
assign <- function(x, names){
xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES"))
z <- rep(NA, length(names))
z[names[xx[1, ]]] <- xx[2, ]
z
}
sx <- lapply(strsplit(x, ";"), strsplit, "=")
ret <- t(sapply(sx, assign, names))
colnames(ret) <- names(names)
ret
Run Code Online (Sandbox Code Playgroud)
结果:
ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc CANONICAL GMAF DISTANCE HGNC
[1,] NA NA NA NA "1//1" NA "ENST00000495576.1:n.820-1G>A" "YES" NA NA NA
[2,] NA NA NA NA NA NA NA NA NA "2179" NA
[3,] NA NA NA "4//4" NA NA "ENST00000466430.1:n.911C>T" "YES" NA NA NA
[4,] NA NA NA NA NA NA NA "YES" NA "27" NA
CCDS SIFT PolyPhen common
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
[4,] NA NA NA "YES"
Run Code Online (Sandbox Code Playgroud)