我正在尝试计算数据帧的多个统计信息.
我试过dplyr的summarise_each.但是,结果以平面单行返回,函数名称作为后缀添加.
有没有直接的方法 - 使用dplyr或基础r - 我可以在数据框中获得结果,列作为数据框的列,行作为汇总函数?
library(dplyr)
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
df %>% summarise_each(funs(min, max))
# A_min B_min C_min A_max B_max C_max
# 1 13 117 0 98 188 1
# Desired format
summary(df)
# A B C
# Min. :13.00 Min. :117.0 Min. :0.00
# 1st Qu.:34.75 1st Qu.:134.2 1st Qu.:0.00
# Median :45.00 Median :148.0 Median :1.00
# Mean :52.35 Mean :149.9 Mean :0.65
# 3rd Qu.:62.50 3rd Qu.:168.8 3rd Qu.:1.00
# Max. :98.00 Max. :188.0 Max. :1.00
Run Code Online (Sandbox Code Playgroud)
怎么样:
library(tidyr)
gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)# A tibble: 3 × 3 key min max <chr> <dbl> <dbl> 1 A 2 92 2 B 111 194 3 C 0 1
为什么不只是简单地使用sapply有summary?
sapply(df, summary)
Run Code Online (Sandbox Code Playgroud)
得到:
Run Code Online (Sandbox Code Playgroud)A B C Min. 1.00 112.0 0.00 1st Qu. 23.75 134.5 0.00 Median 57.00 148.5 1.00 Mean 50.15 149.9 0.55 3rd Qu. 77.50 167.2 1.00 Max. 94.00 191.0 1.00
要获取数据帧,只需将sapply调用包装在data.frame():data.frame(sapply(df, summary)).如果要在列中保留摘要统计信息,可以rownames(df)使用df$rn <- rownames(df)或keep.rownames从data.table以下参数中使用-parameter 来提取它们:
library(data.table)
dt <- data.table(sapply(df, summary), keep.rownames = TRUE)
Run Code Online (Sandbox Code Playgroud)
这使:
Run Code Online (Sandbox Code Playgroud)> dt rn A B C 1: Min. 11.00 113.0 0.0 2: 1st Qu. 21.50 126.8 0.0 3: Median 55.00 138.0 0.5 4: Mean 53.65 145.2 0.5 5: 3rd Qu. 83.25 160.5 1.0 6: Max. 98.00 193.0 1.0
这不是唯一的方法,但您可以根据需要使用dplyr和来重塑 data.frame tidyr。(和/stringr或其他修剪字符。)
library(dplyr)
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
as_data_frame(summary(df)) %>%
# some blank character could be trim
mutate(Var2 = stringr::str_trim(Var2)) %>%
# you don't need Var1
select(-Var1) %>%
# Get the type of summary and the value
tidyr::separate(n, c("Type", "value"), sep = ":") %>%
# Convert value to numeric
mutate(value = as.numeric(value)) %>%
# reshape as you wish
tidyr::spread(Var2, value, drop = T)
#> # A tibble: 6 x 4
#> Type A B C
#> * <chr> <dbl> <dbl> <dbl>
#> 1 1st Qu. 36.25 122.2 1.00
#> 2 3rd Qu. 77.25 164.5 1.00
#> 3 Max. 95.00 193.0 1.00
#> 4 Mean 57.30 144.6 0.85
#> 5 Median 63.00 143.5 1.00
#> 6 Min. 8.00 111.0 0.00
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6537 次 |
| 最近记录: |