将频率数据帧转换为更宽的格式

Question

将频率数据帧转换为更宽的格式

我有一个看起来像这样的数据框.

input dataframe

position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2

Run Code Online (Sandbox Code Playgroud)

这些数据是在一个假设的基因组中的给定位置的核苷酸差异,mean_freq是相对于参考,所以第一行指的比例C's被0.002暗示A是在0.998.

我想通过创建新列来将其转换为不同的结构,

desired_output

position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2

Run Code Online (Sandbox Code Playgroud)

我尝试过这种方法

per_position_full_nt_freq <- function(x){
  df <- data.frame(A=0, C=0, G=0, T=0)
  idx <- names(df) %in% x$alternative
  df[,idx] <- x$mean_freq
  idx2 <- names(df) %in% x$reference 
  df[,idx2] <- 1 - x$mean_freq
  df$position <- x$position
  df$sampleName <- x$sampleName
  return(df)
}

desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)

Run Code Online (Sandbox Code Playgroud)

我遇到了一个错误

In matrix(value, n, p) :
  data length [8905] is not a sub-multiple or multiple of the number of columns

Run Code Online (Sandbox Code Playgroud)

另外,我觉得必须有一个更直观的解决方案,并且可能使用 tidyr或dplyr.如何方便地将输入数据帧转换为所需的输出数据帧格式？

谢谢.

Answer 1

akr*_*run 4

一种选择是matrix使用“G”、“C”、“T”、“A”列名称创建一个 0，match使用原始数据集的列名称，使用row/column索引分配值，然后cbind使用原始数据集的列名称。数据集的“position”和“sample_id”，列

m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))]  <-  df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))]  <-  0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
#   position    G     C    T     A sample_id
#1        1 0.00 0.002 0.00 0.098     name1
#2        2 0.06 0.000 0.04 0.000     name1
#3        3 0.00 0.030 0.00 0.070     name2

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	63 次
最近记录：	8 年，8 月前