R中的代码更快

Kev*_*vin 6 simulation for-loop r vectorization

仅供参考:自从我的第一版以来,我对此进行了大量编辑.这种模拟已经从14小时减少到14分钟.

我是编程的新手,但我做了一个模拟试图遵循有机体中的无性复制并量化母体和子体生物之间染色体数量的差异.模拟运行速度非常慢.完成大约需要6个小时.我想知道什么是使模拟运行更快的最佳方法.

这些数字生物有x个染色体.与大多数生物体不同,染色体彼此独立,因此它们具有转移到子体生物体中的相同机会.

在这种情况下,染色体进入子细胞的分布遵循二项分布,概率为0.5.

该函数sim_repo采用具有已知数量染色体的数字生物矩阵,并使它们经历12代复制.它复制这些染色体,然后使用该rbinom函数随机生成一个数字.然后将该数字分配给子单元格.由于在无性繁殖期间没有染色体丢失,另一个子细胞接收剩余的染色体.然后对G代进行重复.然后从矩阵中的每一行采样单个值.

 sim_repo = function( x1, G=12, k=1, t=25, h=1000 ) {

            # x1 is the list of copy numbers for a somatic chromosome
            # G is the number of generations, default is 12
            # k is the transfer size, default is 1
            # t is the number of transfers, default is 25
            # h is the number of times to replicate, default is 1000

            dup <- x1 * 2 # duplicate the initial somatic chromosome copy number for replication
            pop <- 1 # set generation time
            set.seed(11)
            z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic chromosome is distributed to one of the daughter cells
            z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
            x1 <- cbind(z, z1) # put both in a matrix

            for ( pop in 1:G ) { # this loop does the replication for each cell in each generation
                pop <- 1 + pop # number of generations.  This is a count for the for loop
                dup <- x1 * 2 # double the somatic chromosomes for replication
                set.seed(11)
                z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic c hromosomes distributed to one of the daughter cells
                z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
                x1 <- cbind(z, z1) # put both in a matrix
                }

            # the following for loop randomly selects one cell in the population that was created
            # the output is a matrix of 1 column
            x1 <- matrix(apply(x1, 1, sample, size=k), ncol=1)
            x1
    }
Run Code Online (Sandbox Code Playgroud)

我的研究我感兴趣的是初始祖先生物的染色体变化和这个模拟的最后时间点.以下功能表示将细胞转移到新的生活环境中.它接受函数sim_rep 的输出并使用它来生成更多代.然后,它在第一个和最后一个矩阵列中查找行之间的差异,并找出它们之间的差异.

    # The following function is mostly the same as I talked about in the description.
    # The only difference is I changed some aspects to take into account I am using
    # matrices and not lists.
    # The function outputs the difference between the intial variance component between
    # 'cell lines' with the final variance after t number of transfers

sim_exp = function( x1, G=12, k=1, t=25, h=1000 ) {

    xn <- matrix(NA, nrow(x1), t)  
    x <- x1
    xn[,1] <- x1
    for ( l in 2:t ) {
        x <- sim_repo( x, G, k, t, h )
        xn[, l] <- x
    }

    colvar <- matrix(apply(xn,2,var),ncol=ncol(xn))
    ivar <- colvar[,1]
    fvar <- colvar[,ncol(xn)]
    deltavar <- fvar - ivar
    deltavar
}  
Run Code Online (Sandbox Code Playgroud)

我需要复制这个模拟^ h的次数.因此,我提出了以下的功能,将调用该函数sim_exp ^ h的次数.

sim_1000 = function( x1, G=12, k=1, t=25, h=1000 ) {
    xn <- vector(length=h)
    for ( l in 2:h ) {
        x <- sim_exp( x1, G, k, t, h )
        xn[l] <- x
    }
        xn
}
Run Code Online (Sandbox Code Playgroud)

当我使用类似6值的值调用sim_exp函数时,需要大约52秒才能完成.

 x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1)
 system.time(sim_1000(x1,h=1))
   user  system elapsed 
  1.280   0.105   1.369 
Run Code Online (Sandbox Code Playgroud)

如果我能更快地完成它,那么我可以完成更多这些模拟并在模拟上应用选择模型.

我的输入将如下所示x1,一个矩阵与每个祖先生物体在自己的行上:

x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1) # a matrix of 6 organisms
Run Code Online (Sandbox Code Playgroud)

当我跑:

a <- sim_repo(x1, G=12, k=1)
Run Code Online (Sandbox Code Playgroud)

我的预期输出将是:

 a
     [,1]
[1,]  137
[2,]   82
[3,]   89
[4,]  135
[5,]   89
[6,]  109

 system.time(sim_repo(x1))
   user  system elapsed 
  1.969   0.059   2.010 
Run Code Online (Sandbox Code Playgroud)

当我调用sim_exp函数时,

b < - sim_exp(x1,G = 12,k = 1,t = 25)

它调用sim_repo函数G次并输出:

 b
[1] 18805.47
Run Code Online (Sandbox Code Playgroud)

当我调用该sim_1000函数时,我通常将h设置为1000,但在这里我将其设置为2.所以这里sim_1000将调用sim_exp并复制它2次.

c <- sim_1000(x1, G=12, k=1, t=25, h=2)
c
[1] 18805.47 18805.47
Run Code Online (Sandbox Code Playgroud)

jor*_*ran 8

正如评论中的其他人所提到的,如果我们只查看函数sim_repo,并替换该行:

dup <- apply(x1, c(1,2),"*",2)
Run Code Online (Sandbox Code Playgroud)

dup <- x1 * 2
Run Code Online (Sandbox Code Playgroud)

线条

z <- apply(dup,c(1,2),rbinom,n=1,prob=0.5)
Run Code Online (Sandbox Code Playgroud)

z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup))
Run Code Online (Sandbox Code Playgroud)

和内部for循环

x1 <- matrix(apply(x1,1,sample,size = 1), ncol=1)
Run Code Online (Sandbox Code Playgroud)

我得到了一个很大的速度提升:

system.time(sim_exp(x1))
   user  system elapsed 
  0.655   0.017   0.686 
> system.time(sim_expOld(x1))
   user  system elapsed 
 21.445   0.128  21.530 
Run Code Online (Sandbox Code Playgroud)

我确认它正在做同样的事情:

set.seed(123)
out1 <- sim_exp(x1)

set.seed(123)
out2 <- sim_expOld(x1)

all.equal(out1,out2)
> TRUE
Run Code Online (Sandbox Code Playgroud)

而且,这甚至不会深入研究预分配,考虑到您构建代码的方式,在没有完全重新设计的情况下实际上很难实现.

而且甚至还没有开始考虑你是否真的需要所有这三个功能......

  • 我想感谢你的帮助.我的模拟现在在大约13分钟时运行1000次重复.这是一个很好的教训,因为某些东西有效,并不意味着它是有效的.我现在可以运行很多模拟. (2认同)