我有一个数据框如下:
df <- data.frame(s1=c("a","a/b","b","a","a/b"),s2=c("ab/bb","bb","ab","ab","bb"),s3=c("Doa","Doa","Dob/Doa","Dob/Doa","Dob"))
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)s1 s2 s3 1 a ab/bb Doa 2 a/b bb Doa 3 b ab Dob/Doa 4 a ab Dob/Doa 5 a/b bb Dob
每列可以采用两个值之一或两者都用“/”分隔。我想根据它们的值将它们分解为二进制的列集。
所需的数据框是:
Run Code Online (Sandbox Code Playgroud)a b ab bb Doa Dob 1 1 0 1 1 1 0 2 1 1 0 1 1 0 3 0 1 1 0 1 1 4 1 0 1 0 1 1 5 1 1 0 1 0 1
我尝试使用 tidyr::separate 和 tapply 来做这件事,但它变得相当复杂,因为我必须为每一对指定列名。有很多列。
首先确保您的数据是字符而不是因素。然后为每一行和每一行拆分为一个 data.frame,取 str_split on '/',将名称设置为等于值,并将其设为列表。现在您可以将这些结果绑定在一起,并在最后将所有非 na 值设置为 1。
library(tidyverse) # dplyr, + stringr for str_split, + purrr for map
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% setNames(., .) %>% as.list) %>%
bind_rows %>%
mutate_all(~as.numeric(!is.na(.)))
# # A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 0 0
# 2 1 0 1 1 1 0
# 3 0 1 0 1 1 1
# 4 1 1 0 1 0 1
# 5 1 0 1 0 1 1
Run Code Online (Sandbox Code Playgroud)
另一个类似的选项(相同的输出)
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% table %>% as.list) %>%
bind_rows %>%
mutate_all(replace_na, 0)
Run Code Online (Sandbox Code Playgroud)
或者您可以先转换为 long 然后再转换为宽,类似于 akrun 的答案
library(data.table)
setDT(df)
library(magrittr)
melt(df[, r := 1:.N], 'r') %>%
.[, .(value = strsplit(value, '/')[[1]]), .(r, variable)] %>%
dcast(r ~ value, fun.aggregate = length)
# r Doa Dob a ab b bb
# 1: 1 1 0 1 1 0 1
# 2: 2 1 0 1 0 1 1
# 3: 3 1 1 0 1 1 0
# 4: 4 1 1 1 1 0 0
# 5: 5 0 1 1 0 1 1
Run Code Online (Sandbox Code Playgroud)
另一种方法是使用pivot_longer“长”格式,然后用于separate_rows拆分“值”列并重塑为“宽”格式
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
separate_rows(value) %>%
mutate(i1 = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = i1, values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 0 0
#2 1 0 1 1 1 0
#3 0 1 0 1 1 1
#4 1 1 0 1 0 1
#5 1 0 1 0 1 1
Run Code Online (Sandbox Code Playgroud)
或base R与table和一起使用strsplit
+(table(stack(setNames(strsplit(as.character(unlist(df)), "/",
fixed = TRUE), c(row(df))))[2:1]) > 0)
# values
#ind a ab b bb Doa Dob
# 1 1 1 0 1 1 0
# 2 1 0 1 1 1 0
# 3 0 1 1 0 1 1
# 4 1 1 0 0 1 1
# 5 1 0 1 1 0 1
Run Code Online (Sandbox Code Playgroud)