Rob*_*sey 1 refactoring r dynamic tidyr dummy-variable
我有用户级数据,如下所示:
ID V1 V2 V3 V4
001 1 0 1 0
002 0 1 0 1
003 0 0 0 0
004 1 1 1 0
Run Code Online (Sandbox Code Playgroud)
在上面的例子中,我想要一个优雅的解决方案(可能使用tidyr)来动态重构它,使其显示为:
ID Num_Vars Var1 Var2 Var3
001 2 V1 V3 NA
002 2 V2 V4 NA
003 0 NA NA NA
004 3 V1 V2 V3
Run Code Online (Sandbox Code Playgroud)
请注意,此示例已简化,实际上存在许多变量.关键是要根据Var1-VarX中为任何用户填充的最大1个数来检测应创建多少变量的代码.
这感觉就像一些相当标准的重塑:转换为long,按组操作,转换回宽:
df %>%
gather(key = var, value = value, -ID) %>%
group_by(ID) %>%
filter(value != 0) %>%
mutate(Num_Vars = n(),
Var_Label = paste0("Var", 1:n())) %>%
spread(key = Var_Label, value = var) %>%
select(-value) %>%
full_join(distinct(df, ID))
# Source: local data frame [4 x 5]
# Groups: ID [?]
#
# ID Num_Vars Var1 Var2 Var3
# <int> <int> <chr> <chr> <chr>
# 1 1 2 V1 V3 <NA>
# 2 2 2 V2 V4 <NA>
# 3 4 3 V1 V2 V3
# 4 3 NA <NA> <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
使用此数据可重复共享dput():
df = structure(list(ID = 1:4, V1 = c(1L, 0L, 0L, 1L), V2 = c(0L, 1L,
0L, 1L), V3 = c(1L, 0L, 0L, 1L), V4 = c(0L, 1L, 0L, 0L)), .Names = c("ID",
"V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA,
-4L))
Run Code Online (Sandbox Code Playgroud)