如何使用dplyr计算R中的分组z得分？

Question

如何使用dplyr计算R中的分组z得分？

使用iris数据集，我试图为每个变量计算z得分。通过执行以下操作，我得到的数据格式整齐：

library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')

Run Code Online (Sandbox Code Playgroud)

这给了我以下内容：

  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
3  setosa Sepal.Length   4.7
4  setosa Sepal.Length   4.6
5  setosa Sepal.Length   5.0
6  setosa Sepal.Length   5.4

Run Code Online (Sandbox Code Playgroud)

但是，当我尝试为每个组创建一个z分数列（例如Sepal.Length的z分数与Sepal。Width的z分数不兼容）时，请使用以下命令：

test <- test %>% 
  group_by(Species, variable) %>% 
  mutate(z_score = (value - mean(value)) / sd(value))

Run Code Online (Sandbox Code Playgroud)

所得的z得分尚未分组，并且基于所有数据。

使用dpylr按组返回z分数的最佳方法是什么？

非常感谢！

Answer 1

Rui*_*das 6

我相信您在使用mean/sd. 只需使用功能scale。

test <- test %>% 
  group_by(Species, variable) %>% 
  mutate(z_score = scale(value))

test
## A tibble: 600 x 4
## Groups:   Species, variable [12]
#   Species     variable value     z_score
#    <fctr>       <fctr> <dbl>       <dbl>
# 1  setosa Sepal.Length   5.1  0.26667447
# 2  setosa Sepal.Length   4.9 -0.30071802
# 3  setosa Sepal.Length   4.7 -0.86811050
# 4  setosa Sepal.Length   4.6 -1.15180675
# 5  setosa Sepal.Length   5.0 -0.01702177
# 6  setosa Sepal.Length   5.4  1.11776320
# 7  setosa Sepal.Length   4.6 -1.15180675
# 8  setosa Sepal.Length   5.0 -0.01702177
# 9  setosa Sepal.Length   4.4 -1.71919923
#10  setosa Sepal.Length   4.9 -0.30071802
## ... with 590 more rows

Run Code Online (Sandbox Code Playgroud)

编辑。
在 OP 发表评论之后，我发布了一些代码来获取Petal.Width具有正z_score.

i1 <- which(test$variable == "Petal.Width" & test$z_score > 0)
test[i1, ]
## A tibble: 61 x 4
## Groups:   Species, variable [3]
#   Species    variable value  z_score
#    <fctr>      <fctr> <dbl>    <dbl>
# 1  setosa Petal.Width   0.4 1.461300
# 2  setosa Petal.Width   0.3 0.512404
# 3  setosa Petal.Width   0.4 1.461300
# 4  setosa Petal.Width   0.4 1.461300
# 5  setosa Petal.Width   0.3 0.512404
# 6  setosa Petal.Width   0.3 0.512404
# 7  setosa Petal.Width   0.3 0.512404
# 8  setosa Petal.Width   0.4 1.461300
# 9  setosa Petal.Width   0.5 2.410197
#10  setosa Petal.Width   0.4 1.461300
## ... with 51 more rows

Run Code Online (Sandbox Code Playgroud)

Answer 2

eip*_*i10 5

您的代码按组为您提供z得分。在我看来，这些z得分应该完全可比，因为您已将每个组分别缩放为均值= 0和sd = 1，而不是根据整个数据帧的均值和sd来缩放每个值。例如：

library(tidyverse)

Run Code Online (Sandbox Code Playgroud)

首先，设置融化的数据框：

dat = iris %>% 
  gather(variable, value, -Species) %>%
  group_by(Species, variable) %>% 
  mutate(z_score_group = (value - mean(value)) / sd(value)) %>%   # You can also use scale(value) as pointed out by @RuiBarradas
  ungroup %>% 
  mutate(z_score_ungrouped = (value - mean(value)) / sd(value))

Run Code Online (Sandbox Code Playgroud)

现在查看前三行，并与直接计算进行比较：

head(dat, 3)

#   Species     variable value z_score_group z_score_ungrouped
# 1  setosa Sepal.Length   5.1     0.2666745         0.8278959
# 2  setosa Sepal.Length   4.9    -0.3007180         0.7266552
# 3  setosa Sepal.Length   4.7    -0.8681105         0.6254145

# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))

# [1]  0.2666745 -0.3007180 -0.8681105

# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))

# [1] 0.8278959 0.7266552 0.6254145

Run Code Online (Sandbox Code Playgroud)

现在可视化z分数：下面的第一张图是原始数据。第二个是未分组的z得分-我们刚刚将数据重新缩放为总体均值= 0和SD = 1。第三张图是您的代码产生的结果。每个组已分别缩放为均值= 0和SD = 1。

gridExtra::grid.arrange(
  grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>% 
    map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")), 
                 aes_string(.x, colour="group")) + geom_density()),
  ncol=1)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	4578 次
最近记录：	8 年，5 月前