我有一个这样的数据集:
ID color1 color2 color3 shape1 shape2 size
55 red blue NA circle triangle small
67 yellow NA NA triangle NA medium
83 blue yellow NA circle NA large
78 red yellow blue square circle large
43 green NA NA square circle small
29 yellow green NA circle triangle medium
Run Code Online (Sandbox Code Playgroud)
我想创建一个数据框,其中包含每个变量的频率和百分比,但我遇到了麻烦,因为在某些情况下同一变量有多个列。
Variable Level Freq Percent
color blue 3 27.27
red 2 18.18
yellow 4 36.36
green 2 18.18
total 11 100.00
shape circle 5 50.0
triangle 3 30.0
square 2 20.0
total 10 100.0
size small 2 33.3
medium 2 33.3
large 2 33.3
total 6 100.0
Run Code Online (Sandbox Code Playgroud)
我相信我需要将这些变量转换为 long,然后使用 summarize/mutate 来获取频率,但我似乎无法弄清楚。任何帮助是极大的赞赏。
您可以使用tidyverse包将数据转换为长格式,然后汇总所需的统计信息。
library(tidyverse)
df |>
# Transform all columns into a long format
pivot_longer(cols = -ID,
names_pattern = "([A-z]+)",
names_to = c("variable")) |>
# Drop NA entries
drop_na(value) |>
# Group by variable
group_by(variable) |>
# Count
count(value) |>
# Calculate percentage as n / sum of n by variable
mutate(perc = 100* n / sum(n))
# A tibble: 10 x 4
# Groups: variable [3]
# variable value n perc
# <chr> <chr> <int> <dbl>
# 1 color blue 3 27.3
# 2 color green 2 18.2
# 3 color red 2 18.2
# 4 color yellow 4 36.4
# 5 shape circle 5 50
# 6 shape square 2 20
# 7 shape triangle 3 30
# 8 size large 2 33.3
# 9 size medium 2 33.3
#10 size small 2 33.3
Run Code Online (Sandbox Code Playgroud)