如何使用 dplyr 分组进行统计测试,然后用扫帚制作 tibble

sca*_*der 4 r broom tidyverse

我有以下数据框:

library(tidyverse)

dat <- structure(list(charge.Group3 = c(0.167, 0.167, 0.1, 0.067, 0.033, 
0.033, 0.067, 0.133, 0.2, 0.067, 0.133, 0.114, 0.167, 0.033, 
0.1, 0.033, 0.133, 0.267, 0.133, 0.233, 0.1, 0.167, 0.067, 0.133, 
0.1, 0.133, 0.1, 0.133, 0.1, 0.067, 0.167, 0), hydrophobicity.Group3 = c(0.267, 
0.467, 0.067, 0.167, 0.267, 0.1, 0.367, 0.233, 0.367, 0.233, 
0.133, 0.205, 0.333, 0.267, 0.267, 0.067, 0.133, 0.3, 0.233, 
0.267, 0.5, 0.333, 0.2, 0.5, 0.5, 0.4, 0.033, 0.3, 0.233, 0.5, 
0.233, 0.033), class = c("Negative", "Negative", "Positive", 
"Positive", "Positive", "Positive", "Positive", "Negative", "Positive", 
"Positive", "Positive", "Positive", "Positive", "Positive", "Negative", 
"Positive", "Negative", "Negative", "Negative", "Negative", "Negative", 
"Negative", "Negative", "Negative", "Negative", "Negative", "Positive", 
"Positive", "Positive", "Negative", "Positive", "Negative")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -32L))

dat
#> # A tibble: 32 x 3
#>    charge.Group3 hydrophobicity.Group3 class   
#>            <dbl>                 <dbl> <chr>   
#>  1         0.167                 0.267 Negative
#>  2         0.167                 0.467 Negative
#>  3         0.1                   0.067 Positive
#>  4         0.067                 0.167 Positive
#>  5         0.033                 0.267 Positive
#>  6         0.033                 0.1   Positive
#>  7         0.067                 0.367 Positive
#>  8         0.133                 0.233 Negative
#>  9         0.2                   0.367 Positive
#> 10         0.067                 0.233 Positive
#> # ... with 22 more rows
Run Code Online (Sandbox Code Playgroud)

我想为每个功能做什么:charge.Group3hydrophobicity.Group3wilcox.test在负类和正类之间执行。最后得到 p 值作为数据框或小标题:

features                      pvalue
charge.Group3                 0.1088  
hydrophobicity.Group3         0.03895
# I do by hand.
Run Code Online (Sandbox Code Playgroud)

请注意,实际上有 2 个以上的功能。我怎样才能做到这一点?

Ant*_*osK 5

broom如果您只需要测试的 p 值,则您实际上不需要使用。

library(tidyverse)


dat %>% 
  gather(group, value, -class) %>%    # reshape data            
  nest(-group) %>%                    # for each group nest data
  mutate(pval = map_dbl(data, ~wilcox.test(value ~ class, data = .)$p.value)) %>%  # get p value for wilcoxon test
  select(-data)                       # remove data column


# # A tibble: 2 x 2
#   group                   pval
#   <chr>                  <dbl>
# 1 charge.Group3         0.109 
# 2 hydrophobicity.Group3 0.0390        
Run Code Online (Sandbox Code Playgroud)

首先重塑将使您能够应用此过程,无论您有多少列,假设这class是唯一的额外变量。

或者你甚至可以map像@Moody_Mudskipper 建议的那样避免使用

dat %>% 
  gather(group, value, -class) %>% 
  group_by(group) %>% 
  summarize(results = wilcox.test(value ~ class)$p.value)
Run Code Online (Sandbox Code Playgroud)

如果你真的想参与,broom那么你可以做

library(broom)

dat %>% 
   gather(group, value, -class) %>%  
   nest(-group) %>%                  
   mutate(results = map(data, ~tidy(wilcox.test(value ~ class, data = .)))) %>%
   select(-data) %>%
   unnest(results)

# # A tibble: 2 x 5
# group                 statistic p.value method                                            alternative
#   <chr>                     <dbl>   <dbl> <chr>                                             <chr>      
# 1 charge.Group3              170.  0.109  Wilcoxon rank sum test with continuity correction two.sided  
# 2 hydrophobicity.Group3      183   0.0390 Wilcoxon rank sum test with continuity correction two.sided 
Run Code Online (Sandbox Code Playgroud)

它返回更多列,但如果需要,您可以保留 p 值。