sca*_*der 4 string algorithm optimization r combinatorics
我有以下参考序列:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
Run Code Online (Sandbox Code Playgroud)
和这个种子模式字符串:
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
Run Code Online (Sandbox Code Playgroud)
该模式中有 10 个通配符 (?)。
鉴于此功能:
aa_count_normalized <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
) / nchar(x)
AAC
}
aa_count <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
)
AAC
}
Run Code Online (Sandbox Code Playgroud)
我可以得到 :
# we need to normalize refseq_aa content with respect
# to the length of seed_pattern, to accommodate the length
# difference between the two.
> refseq_aa_content <- aa_count_normalized(ref_seq) * nchar(seed_pattern)
> refseq_aa_content
A R N D C E
0.0000000 4.7727273 1.3636364 0.0000000 0.6818182 0.0000000
Q G H I L K
2.7272727 3.4090909 2.0454545 0.6818182 2.0454545 1.3636364
M F P S T W
1.3636364 1.3636364 0.6818182 4.0909091 0.6818182 0.6818182
Y V
1.3636364 0.6818182
Run Code Online (Sandbox Code Playgroud)
我想要做的是将 - 的通配符替换为seed pattern- 同时保持非通配符原样 - 使用来自以下内容的残基组合:
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
Run Code Online (Sandbox Code Playgroud)
使得种子序列的最终氨基酸计数和标准化参考序列计数的均方误差(MSE)最小化。
使用此 MSE 函数:
mse <- function (ref, new_seq) {
return(mean((ref - new_seq)^2))
}
Run Code Online (Sandbox Code Playgroud)
以及最终的种子序列:
seed_final.1 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY")
seed_final.2 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQQGGGSSY")
seed_final.3 <- aa_count("FKDHKHIDVKDRHRTRHLAKSSSGGGRRQQ") # onyambu's
Run Code Online (Sandbox Code Playgroud)
我明白了
> mse(refseq_aa_content, seed_final.1 )
[1] 1.501446
> mse(refseq_aa_content, seed_final.2 )
[1] 1.63781
> mse(refseq_aa_content, seed_final.3 )
[1] 1.560537
Run Code Online (Sandbox Code Playgroud)
是seed_final.1最优精确解,因为它的 MSE 最低。即 10 ?s 将替换为:
G Q R S Y
3 2 1 3 1 (total 10)
Run Code Online (Sandbox Code Playgroud)
我怎样才能创建一个高效的 R 代码作为FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY
答案返回。
您可以将问题建模为要最小化的整数二次问题:
sum(r^2) - 2 sum(z * r)
有约束条件:
sum(r) = k
r[i] nonegative integer
在哪里:
r[i]AADict你需要添加多少个字母seed_patternz[i] = n(y)/n(x) * x[i] - y[i]x[i]AADictin的第 i 个字母的计数ref_seqy[i]AADictin的第 i 个字母的计数seed_patternn(x)中的字符数ref_seqn(y)中的字符数seed_patternk中通配符的数量seed_pattern我没能在 R 中找到混合整数二次求解器(免费),所以这里是启发式使用DEoptimR:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
aminoSummary <- function(x){
f <- factor(strsplit(x, split = "")[[1]], levels = AADict)
list(
l = nchar(x),
k = sum(is.na(f)),
z = table(f)
)
}
x <- aminoSummary(ref_seq)
y <- aminoSummary(seed_pattern)
M <- length(AADict)
res <- DEoptimR::JDEoptim(
lower = rep(0, M),
upper = rep(y$k, M) + 1,
fn = function(r, z, k){
r <- floor(r)
sum(r * r) - 2 * sum(z * r)
},
constr = function(r, z, k) sum(floor(r)) - k,
meq = 1,
z = as.vector(x$z * y$l / x$l - y$z),
k = y$k
)
rep(AADict, floor(res$par))
[1] "R" "Q" "Q" "G" "G" "G" "S" "S" "S" "Y"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
148 次 |
| 最近记录: |