Ahd*_*dee 8 r dplyr data.table
我有以下数据框。
temp = structure(list(A = c(0, 0, 0, 3.72900887033786, 1.94860084749336,
0), C = c(0, 0, 0, 3.44095219802964, 2.35049724708413, 0.0285691521967709
), A = c(0, 0, 0, 3.29572302453997, 0.933572638261024, 0), D = c(0,
0, 0, 2.4905701304462, 1.54101915313356, 0), E = c(0, 0, 0, 4.23189316164533,
1.7311832415722, 0), E = c(0, 0, 0, 4.37851162325373, 2.50080205305716,
0), D = c(0, 0, 0, 3.68929916053589, 2.4905701304462, 0.189033824390017
), F = c(0, 2.27500704749987, 0, 3.68032435684402, 1.77820857639809,
0), A = c(0, 0, 0, 3.5668151540109, 1.72683121703249, 0.0285691521967709
), G = c(0, 0, 0, 5.6450098843911, 3.09929520433778, 0)), row.names = c("5_8S_rRNA",
"5S_rRNA", "7SK", "A1BG", "A1BG-AS1", "A1CF"), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
看起来像这样。
A C A D E E D F A G
5_8S_rRNA 0.000000 0.00000000 0.0000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.00000000 0.000000
5S_rRNA 0.000000 0.00000000 0.0000000 0.000000 0.000000 0.000000 0.0000000 2.275007 0.00000000 0.000000
7SK 0.000000 0.00000000 0.0000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.00000000 0.000000
A1BG 3.729009 3.44095220 3.2957230 2.490570 4.231893 4.378512 3.6892992 3.680324 3.56681515 5.645010
A1BG-AS1 1.948601 2.35049725 0.9335726 1.541019 1.731183 2.500802 2.4905701 1.778209 1.72683122 3.099295
A1CF 0.000000 0.02856915 0.0000000 0.000000 0.000000 0.000000 0.1890338 0.000000 0.02856915 0.000000
Run Code Online (Sandbox Code Playgroud)
我想做的是通过平均重复项来折叠任何重复的列,但我想对每一行执行此操作。
理想的数据框将包含相同数量的行,但仅包含 A、C、D、E、F、G 列。这可能吗?
我们可以使用split.default按列名称分割并循环list,应用rowMeans
sapply(split.default(temp, names(temp)), rowMeans)
A C D E F G
5_8S_rRNA 0.000000000 0.00000000 0.00000000 0.000000 0.000000 0.000000
5S_rRNA 0.000000000 0.00000000 0.00000000 0.000000 2.275007 0.000000
7SK 0.000000000 0.00000000 0.00000000 0.000000 0.000000 0.000000
A1BG 3.530515683 3.44095220 3.08993465 4.305202 3.680324 5.645010
A1BG-AS1 1.536334901 2.35049725 2.01579464 2.115993 1.778209 3.099295
A1CF 0.009523051 0.02856915 0.09451691 0.000000 0.000000 0.000000
Run Code Online (Sandbox Code Playgroud)
另一个基本 R 解决方案rowsum:
t(rowsum(t(temp), names(temp)) / c(table(names(temp))))
A C D E F G
5_8S_rRNA 0.000000000 0.00000000 0.00000000 0.000000 0.000000 0.000000
5S_rRNA 0.000000000 0.00000000 0.00000000 0.000000 2.275007 0.000000
7SK 0.000000000 0.00000000 0.00000000 0.000000 0.000000 0.000000
A1BG 3.530515683 3.44095220 3.08993465 4.305202 3.680324 5.645010
A1BG-AS1 1.536334901 2.35049725 2.01579464 2.115993 1.778209 3.099295
A1CF 0.009523051 0.02856915 0.09451691 0.000000 0.000000 0.000000
Run Code Online (Sandbox Code Playgroud)
这是一个基本的 R 解决方案:
t(do.call(rbind, by(t(temp), row.names(t(temp)), FUN = colMeans)))
Run Code Online (Sandbox Code Playgroud)
或者使用tidyverse
这是一个tidyverse选项,我们可以在其中拆分唯一名称,然后为每个数据帧设置名称,然后使用rowMeans. 但是,我们必须setNames在第一个map语句中使用来使列名唯一,因为tidyverse不喜欢重复的列名。然后,由于行名也被删除,所以我们可以在最后将它们添加回来。
library(tidyverse)
map(.x = unique(names(temp)), ~
select(setNames(temp, make.names(names(temp), unique = TRUE)), starts_with(.x))) %>%
set_names(unique(names(temp))) %>%
map_dfc(., rowMeans) %>%
as.data.frame() %>%
`rownames<-`(row.names(temp))
Run Code Online (Sandbox Code Playgroud)
或者另一个基本的 R 解决方案:
temp2 <- t(temp)
t(tapply(temp2, list(row.names(temp2)[row(temp2)], colnames(temp2)[col(temp2)]), FUN = mean))
Run Code Online (Sandbox Code Playgroud)
输出
A C D E F G
5_8S_rRNA 0.00000000 0.0000000 0.0000000 0.00000 0.00000 0.00000
5S_rRNA 0.00000000 0.0000000 0.0000000 0.00000 2.27501 0.00000
7SK 0.00000000 0.0000000 0.0000000 0.00000 0.00000 0.00000
A1BG 3.53051568 3.4409522 3.0899346 4.30520 3.68032 5.64501
A1BG-AS1 1.53633490 2.3504972 2.0157946 2.11599 1.77821 3.09930
A1CF 0.00952305 0.0285692 0.0945169 0.00000 0.00000 0.00000
Run Code Online (Sandbox Code Playgroud)