传播多个协调列

Ste*_*ner 2 r

我在单独的列中有相对整齐的数据样本,基因,等位基因和频率.对于每个基因和每个样本,我需要将等位基因及其相应的频率分成不同的列.这就是我拥有的和我需要的东西.

尝试用dplyr/tidyr来做这件事,但我会采取任何我能得到的解决方案.

是)我有的:

data.frame(sample=rep("sample1", 10), 
           gene=rep(paste0("gene", 1:5), each=2), 
           allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
           freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

#     sample  gene allele freq
# 1  sample1 gene1      A  0.9
# 2  sample1 gene1      G  0.1
# 3  sample1 gene2      A  0.8
# 4  sample1 gene2      C  0.2
# 5  sample1 gene3      A  0.7
# 6  sample1 gene3      T  0.3
# 7  sample1 gene4      C  0.6
# 8  sample1 gene4      G  0.4
# 9  sample1 gene5      G  0.5
# 10 sample1 gene5      T  0.5
Run Code Online (Sandbox Code Playgroud)

我想要的是:

data.frame(sample=rep("sample1", 5), 
           gene=paste0("gene", 1:5), 
           allele1=c("A", "A", "A", "C", "G"), 
           allele2=c("G", "C", "T", "G", "T"), 
           freq1=c(.9, .8, .7, .6, .5), 
           freq2=c(.1, .2, .3, .4, .5))

#    sample  gene allele1 allele2 freq1 freq2
# 1 sample1 gene1       A       G   0.9   0.1
# 2 sample1 gene2       A       C   0.8   0.2
# 3 sample1 gene3       A       T   0.7   0.3
# 4 sample1 gene4       C       G   0.6   0.4
# 5 sample1 gene5       G       T   0.5   0.5
Run Code Online (Sandbox Code Playgroud)

akr*_*run 7

您可以使用dcastdevel版本的data.tableie.1.9.5+,可以采取多value.var列.我们创建了一个由'sample'和'gene'分组的序列列('indx').然后dcast从长到宽的格式提到value.var列.

 library(data.table)#v1.9.5+ 
 setDT(df)[, indx:=1:.N,.(sample, gene)]
 dcast(df, sample+gene~indx, value.var=c('allele', 'freq'), sep= '')
 #    sample  gene   allele1 allele2  freq1 freq2
 #1: sample1 gene1        A        G    0.9    0.1
 #2: sample1 gene2        A        C    0.8    0.2
 #3: sample1 gene3        A        T    0.7    0.3
 #4: sample1 gene4        C        G    0.6    0.4
 #5: sample1 gene5        G        T    0.5    0.5
Run Code Online (Sandbox Code Playgroud)

注意:安装devel版本的说明是 here

sep=''参数对于将列名称创建为"allele1","allele2"等非常有用.默认值为"allele_1","allele_2"等(来自@ Arun的评论)


jen*_*yan 5

这使用总结而不是真正的重塑,但可能符合要求.

library(dplyr)
foo <- data.frame(sample=rep("sample1", 10), 
                  gene=rep(paste0("gene", 1:5), each=2), 
                  allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
                  freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

foo %>%
  group_by(sample, gene) %>% 
  summarise(allele1 = first(allele), allele2 = last(allele),
            freq1 = first(freq), freq2 = last(freq))

## Source: local data frame [5 x 6]
## Groups: sample
## 
##    sample  gene allele1 allele2 freq1 freq2
## 1 sample1 gene1       A       G   0.9   0.1
## 2 sample1 gene2       A       C   0.8   0.2
## 3 sample1 gene3       A       T   0.7   0.3
## 4 sample1 gene4       C       G   0.6   0.4
## 5 sample1 gene5       G       T   0.5   0.5
Run Code Online (Sandbox Code Playgroud)