使用lapply创建t-test表

Niu*_*ang 1 r extract lapply

我希望在两个群体之间进行t检验(治疗组内外(分别在下面的样本数据中为1或0)),并且对于不同的研究,所有这些都位于同一数据框中.在下面的示例数据中,我想为1/0治疗组之间的所有变量(样本数据:Age,Dollars,DiseaseCnt)生成t检验.我希望按程序运行这些t检验,而不是跨群体运行.我有生成t检验的逻辑.但是,我需要帮助完成从功能中提取适当部分并创建易于消化的内容的最后一步.

最终,我想要的是:一个t-stats表,p值,执行t检验的变量,以及测试变量的程序.

DT<-data.frame(
               Treated=sample(0:1,1000,replace=T)
              ,Program=c('Program A','Program B','Program C','Program D')
              ,Age=as.integer(rnorm(1000,mean=65,sd=15))
              ,Dollars=as.integer(rpois(1000,lambda=1000))
              ,DiseaseCnt=as.integer(rnorm(1000,mean=5,sd=2)) )

progs<-unique(DT$Program) # Pull program names
vars<-names(DT)[3:5] # pull variables to run t tests

test<-lapply(progs, function(i)
          tt<-lapply(vars, function(j) {t.test( DT[DT$Treated==1 & DT$Program == i,names(DT)==j] 
                                                ,DT[DT$Treated==0 & DT$Program == i,names(DT)==j]
                                                ,alternative = 'two.sided'  ) 
              list(j,tt$statistic,tt$p.value)  }
                 ) ) 
  # nested lapply produces results in list format that can be binded, but complete output w/ both lapply's is erroneous
Run Code Online (Sandbox Code Playgroud)

Chr*_*son 5

你应该把它转换成data.table第一个.(在我的代码中,我称之为原始表格DF):

DT <- as.data.table(DF)
DT[, t.test(data=.SD, Age ~ Treated), by=Program]
   Program  statistic parameter   p.value   conf.int estimate null.value alternative
1: Program A -0.6286875  247.8390 0.5301326 -4.8110579 65.26667          0   two.sided
2: Program A -0.6286875  247.8390 0.5301326  2.4828527 66.43077          0   two.sided
3: Program B  1.4758524  230.5380 0.1413480 -0.9069634 67.15315          0   two.sided
4: Program B  1.4758524  230.5380 0.1413480  6.3211834 64.44604          0   two.sided
5: Program C  0.1994182  246.9302 0.8420998 -3.3560930 63.56557          0   two.sided
6: Program C  0.1994182  246.9302 0.8420998  4.1122406 63.18750          0   two.sided
7: Program D -1.1321569  246.0086 0.2586708 -6.1855837 62.31707          0   two.sided
8: Program D -1.1321569  246.0086 0.2586708  1.6701237 64.57480          0   two.sided
                method      data.name
1: Welch Two Sample t-test Age by Treated
2: Welch Two Sample t-test Age by Treated
3: Welch Two Sample t-test Age by Treated
4: Welch Two Sample t-test Age by Treated
5: Welch Two Sample t-test Age by Treated
6: Welch Two Sample t-test Age by Treated
7: Welch Two Sample t-test Age by Treated
8: Welch Two Sample t-test Age by Treated
Run Code Online (Sandbox Code Playgroud)

在这种格式中,对于每一种Program,两者statistic都是相同的,等于t,parameter这里是df,因为conf.int它(按顺序)低于高于(因此Program A,置信区间是(-4.8110579, 2.4828527),并且对于estimate它将是group 0和然后group 1(所以Program A,平均值为Treated == 065.26667等.

这是我能想到的最快的解决方案,你可以循环vars,或者可能有更简单的方法.


编辑:我只使用以下代码确认Program A和for Age:

DT[Program == 'Program A', t.test(Age ~ Treated)]
    Welch Two Sample t-test

data:  Age by Treated
t = -0.62869, df = 247.84, p-value = 0.5301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.811058  2.482853
sample estimates:
mean in group 0 mean in group 1
       65.26667        66.43077
Run Code Online (Sandbox Code Playgroud)

编辑2:这是代码循环你的变量和rbind它们在一起:

do.call(rbind, lapply(vars, function(x) DT[, t.test(data=.SD, eval(parse(text=x)) ~ Treated), by=Program]))
Run Code Online (Sandbox Code Playgroud)