Wor*_*e11 5 r dplyr data-science
使用iris数据集,我试图为每个变量计算z得分。通过执行以下操作,我得到的数据格式整齐:
library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')
Run Code Online (Sandbox Code Playgroud)
这给了我以下内容:
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
3 setosa Sepal.Length 4.7
4 setosa Sepal.Length 4.6
5 setosa Sepal.Length 5.0
6 setosa Sepal.Length 5.4
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试为每个组创建一个z分数列(例如Sepal.Length的z分数与Sepal。Width的z分数不兼容)时,请使用以下命令:
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = (value - mean(value)) / sd(value))
Run Code Online (Sandbox Code Playgroud)
所得的z得分尚未分组,并且基于所有数据。
使用dpylr按组返回z分数的最佳方法是什么?
非常感谢!
我相信您在使用mean/sd. 只需使用功能scale。
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = scale(value))
test
## A tibble: 600 x 4
## Groups: Species, variable [12]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Sepal.Length 5.1 0.26667447
# 2 setosa Sepal.Length 4.9 -0.30071802
# 3 setosa Sepal.Length 4.7 -0.86811050
# 4 setosa Sepal.Length 4.6 -1.15180675
# 5 setosa Sepal.Length 5.0 -0.01702177
# 6 setosa Sepal.Length 5.4 1.11776320
# 7 setosa Sepal.Length 4.6 -1.15180675
# 8 setosa Sepal.Length 5.0 -0.01702177
# 9 setosa Sepal.Length 4.4 -1.71919923
#10 setosa Sepal.Length 4.9 -0.30071802
## ... with 590 more rows
Run Code Online (Sandbox Code Playgroud)
编辑。
在 OP 发表评论之后,我发布了一些代码来获取Petal.Width具有正z_score.
i1 <- which(test$variable == "Petal.Width" & test$z_score > 0)
test[i1, ]
## A tibble: 61 x 4
## Groups: Species, variable [3]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Petal.Width 0.4 1.461300
# 2 setosa Petal.Width 0.3 0.512404
# 3 setosa Petal.Width 0.4 1.461300
# 4 setosa Petal.Width 0.4 1.461300
# 5 setosa Petal.Width 0.3 0.512404
# 6 setosa Petal.Width 0.3 0.512404
# 7 setosa Petal.Width 0.3 0.512404
# 8 setosa Petal.Width 0.4 1.461300
# 9 setosa Petal.Width 0.5 2.410197
#10 setosa Petal.Width 0.4 1.461300
## ... with 51 more rows
Run Code Online (Sandbox Code Playgroud)
您的代码按组为您提供z得分。在我看来,这些z得分应该完全可比,因为您已将每个组分别缩放为均值= 0和sd = 1,而不是根据整个数据帧的均值和sd来缩放每个值。例如:
library(tidyverse)
Run Code Online (Sandbox Code Playgroud)
首先,设置融化的数据框:
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>% # You can also use scale(value) as pointed out by @RuiBarradas
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Run Code Online (Sandbox Code Playgroud)
现在查看前三行,并与直接计算进行比较:
head(dat, 3)
# Species variable value z_score_group z_score_ungrouped
# 1 setosa Sepal.Length 5.1 0.2666745 0.8278959
# 2 setosa Sepal.Length 4.9 -0.3007180 0.7266552
# 3 setosa Sepal.Length 4.7 -0.8681105 0.6254145
# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))
# [1] 0.2666745 -0.3007180 -0.8681105
# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))
# [1] 0.8278959 0.7266552 0.6254145
Run Code Online (Sandbox Code Playgroud)
现在可视化z分数:下面的第一张图是原始数据。第二个是未分组的z得分-我们刚刚将数据重新缩放为总体均值= 0和SD = 1。第三张图是您的代码产生的结果。每个组已分别缩放为均值= 0和SD = 1。
gridExtra::grid.arrange(
grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>%
map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")),
aes_string(.x, colour="group")) + geom_density()),
ncol=1)
Run Code Online (Sandbox Code Playgroud)