mar*_*ess 1 r dataframe dplyr data.table
我需要从组内的每个观察中减去按组计算的平均值。其中具有挑战性的部分是我在数据框中有一个子分组,或两个级别:V5 和 V4。这是我的 data.frame 结构的示例:
B = as.data.frame(matrix(
c(2,2,3,3,4,3,1,5,7,6,4,5,8,9,2,3,8,4,5,0,7,5,6,7,5,3,2,
"A","A","A","A","B","B","C","C","C",
"TRUE","TRUE","TRUE","TRUE","FALSE","FALSE","FALSE","FALSE","FALSE"),
nrow=9,ncol=5))
Run Code Online (Sandbox Code Playgroud)
所以我的 data.frameB看起来像这样:
V1 V2 V3 V4 V5
1 2 6 5 A TRUE
2 2 4 0 A TRUE
3 3 5 7 A TRUE
4 3 8 5 A TRUE
5 4 9 6 B FALSE
6 3 2 7 B FALSE
7 1 3 5 C FALSE
8 5 8 3 C FALSE
9 7 4 2 C FALSE
Run Code Online (Sandbox Code Playgroud)
因此,如果我按 V5 和 V4 求平均值,我会得到一个新的 data.frame,我称之为test,它考虑了多级分组:
test <- aggregate(. ~ B$V5+B$V4,data=B, mean)
> test
B$V5 B$V4 V1 V2 V3 V4 V5
1 TRUE A 2.5 4.500000 3.75 1 2
2 FALSE B 3.5 4.000000 5.50 2 1
3 FALSE C 4.0 3.666667 3.00 3 1
Run Code Online (Sandbox Code Playgroud)
所以我正在努力解决的是从 data.frame 中test的原始观察中减去 data.frame 中两级组的均值B。直觉上,我认为可能会有一个 apply() 函数和某种条件语句,但是对我来说它是有点高级的编码,我仍在学习 R。
这是一个基于 R 的解决方案:
B <- data.frame(matrix(c(2,2,3,3,4,3,1,5,7,6,4,5,8,9,2,3,8,4,5,0,7,5,6,7,5,3,2), 9),
V4=c("A","A","A","A","B","B","C","C","C"),
V5=c("TRUE","TRUE","TRUE","TRUE","FALSE","FALSE","FALSE","FALSE","FALSE"))
B[1:3] <- lapply(B[1:3], function(x) x - ave(x, B$V4, B$V5, FUN=mean))
B
Run Code Online (Sandbox Code Playgroud)
我使用了其他数据。在您的示例数据框中,所有列都是因子(您不能使用因子进行计算,例如mean(...))。
我们可以用data.table. 转换“data.frame”到“data.table”( setDT(B)),由“V4”分组,“V5”,通过data.table子集环路(.SD),并与获得各列的差异mean该列的每个的团体
library(data.table)
setDT(B)[, lapply(.SD, function(x) x- mean(x)), by = .(V4, V5)]
Run Code Online (Sandbox Code Playgroud)
或者我们可以使用 dplyr
library(dplyr)
B %>%
group_by(V4, V5) %>%
mutate_all(funs(.- mean(.)))
# A tibble: 9 x 5
# Groups: V4, V5 [3]
# V1 V2 V3 V4 V5
# <dbl> <dbl> <dbl> <fctr> <fctr>
#1 -0.5000000 0.25 0.7500000 A TRUE
#2 -0.5000000 -1.75 -4.2500000 A TRUE
#3 0.5000000 -0.75 2.7500000 A TRUE
#4 0.5000000 2.25 0.7500000 A TRUE
#5 0.5000000 3.50 -0.5000000 B FALSE
#6 -0.5000000 -3.50 0.5000000 B FALSE
#7 -3.3333333 -2.00 1.6666667 C FALSE
#8 0.6666667 3.00 -0.3333333 C FALSE
#9 2.6666667 -1.00 -1.3333333 C FALSE
Run Code Online (Sandbox Code Playgroud)
假设前 3 列是 numeric
B <- structure(list(V1 = c(2, 2, 3, 3, 4, 3, 1, 5, 7), V2 = c(6, 4,
5, 8, 9, 2, 3, 8, 4), V3 = c(5, 0, 7, 5, 6, 7, 5, 3, 2), V4 = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V5 = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("FALSE",
"TRUE"), class = "factor")), .Names = c("V1", "V2", "V3",
"V4", "V5"), row.names = c(NA, -9L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)