我有一个大文件,第一列是ID,剩下的1304列是基因型,如下所示.
rsID sample1 sample2 sample3...sample1304
abcd aa bb nc nc
efgh nc nc nc nc
ijkl aa ab aa nc
Run Code Online (Sandbox Code Playgroud)
我想计算每行"nc"值的数量,并将结果输出到另一列,以便我得到以下内容:
rsID sample1 sample2 sample3...sample1304 no_calls
abcd aa bb nc nc 2
efgh nc nc nc nc 4
ijkl aa ab aa nc 1
Run Code Online (Sandbox Code Playgroud)
表函数计算每列的频率,而不是行,如果我转换数据以在表函数中使用,我需要文件看起来像这样:
abcd aa[sample1]
abcd bb[sample2]
abcd nc[sample3] ...
abcd nc[sample1304]
efgh nc[sample1]
efgh nc[sample2]
efgh nc[sample3] ...
efgh nc[sample1304]
Run Code Online (Sandbox Code Playgroud)
使用这种格式,我会得到以下这是我想要的:
ID nc aa ab bb
abcd 2 1 0 1
efgh 4 0 0 0
Run Code Online (Sandbox Code Playgroud)
有没有人知道按行获取频率的简单方法?我正在尝试这个,但它需要相当长的时间才能运行:
rsids$Number_of_no_calls <- apply(rsids, 1, function(x) sum(x=="NC"))
Run Code Online (Sandbox Code Playgroud)
你可以用rowSums.
df$no_calls <- rowSums(df == "nc")
df
# rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd aa bb nc nc 2
#2 efgh nc nc nc nc 4
#3 ijkl aa ab aa nc 1
Run Code Online (Sandbox Code Playgroud)
或者,正如MrFlick所指出的,要从行总和中排除第一列,您可以稍微修改一下方法
df$no_calls <- rowSums(df[-1] == "nc")
Run Code Online (Sandbox Code Playgroud)
关于行名称:它们不计入rowSums,您可以进行简单的测试来演示它:
rownames(df)[1] <- "nc" # name first row "nc"
rowSums(df == "nc") # compute the row sums
#nc 2 3
# 2 4 1 # still the same in first row
Run Code Online (Sandbox Code Playgroud)