聚合数据框列

Car*_*rol 3 r dataframe

我有一个data.frame如下:

>data
    ID     Orginal   Modified
    Sam_1    M         K
    Sam_1    K         M
    Sam_1    I         J
    Sam_1    M         K
    Sam_1    K         M
    Sam_2    K         M
    Sam_2    M         K
    Sam_3    J         P
    Sam_4    K         M
    Sam_4    M         K
    Sam_4    P         J 
Run Code Online (Sandbox Code Playgroud)

我想计算每个样本数量的时间M列中的"原始"在"修改"列中转换为K而"K"在"修改"列中将"原始"列转换为"M"并在制表符分隔文本中报告文件如下:

>newdata
    ID     M_to_K_counts  K_to_M_counts 
    Sam_1     2                2 
    Sam_2     1                1
    Sam_3     0                0
    Sam_4     1                1
Run Code Online (Sandbox Code Playgroud)

我尝试了以下代码,但失败了:

counts=function()
{
for(i in 1:dim(rnaseqmut)[1])
{
  mk_counts=0
  km_counts=0
  if(data$Original[i]=='M' & data$Modified[i]== 'K')
    {
       mk_counts=mk_counts+1
    }
  if(data$Original[i]=='K' & data$Modified[i]== 'M')
    {
       km_counts=km_counts+1
    }
}
print(mk_counts)
print(km_counts)
}
Run Code Online (Sandbox Code Playgroud)

我怎样才能达到我想要的格式.

akr*_*run 5

一种选择是使用data.table.将'data.frame'转换为'data.table'(setDT(data)).通过"ID"列分组,我们得到sum'原始'的'M'和'Modified'('MtoKcount')的'K'元素,类似地,通过反向得到'KtoMcount'.

library(data.table)
setDT(data)[, list(MtoKcount=sum(Orginal=='M' & Modified=='K'),
               KtoMcount = sum(Orginal=='K' & Modified=='M')), by =  ID]
#       ID MtoKcount KtoMcount
#1: Sam_1         2         2
#2: Sam_2         1         1
#3: Sam_3         0         0
#4: Sam_4         1         1
Run Code Online (Sandbox Code Playgroud)

另一种选择是table来自base R.我们paste使用"ID"列(do.call(paste0, data[-1]))以外的列,并使用它来获取频率计数table.然后,我们将只有'KM'或'MK'作为列名的表输出('tbl')进行子集化

 tbl <- table(data$ID,do.call(paste0, data[-1]))[,c('KM', 'MK')]
 tbl
 #      KM MK
 #Sam_1  2  2
 #Sam_2  1  1
 #Sam_3  0  0
 #Sam_4  1  1
Run Code Online (Sandbox Code Playgroud)

正如评论中提到的@ user295691,我们可以在更改时更改列名paste.

  tbl <- with(data, table(ID, paste0(Orginal, "_to_", Modified,"_counts"))) 
  tbl[,c('K_to_M_counts', 'M_to_K_counts')]
Run Code Online (Sandbox Code Playgroud)

数据

data <- structure(list(ID = c("Sam_1", "Sam_1", "Sam_1", "Sam_1", 
"Sam_1", 
"Sam_2", "Sam_2", "Sam_3", "Sam_4", "Sam_4", "Sam_4"), Orginal = c("M", 
"K", "I", "M", "K", "K", "M", "J", "K", "M", "P"), Modified = c("K", 
"M", "J", "K", "M", "M", "K", "P", "M", "K", "J")), .Names = c("ID", 
"Orginal", "Modified"), class = "data.frame", row.names = c(NA, 
-11L))
Run Code Online (Sandbox Code Playgroud)