使用dplyr有条件地替换列中的值

use*_*934 9 r dplyr

我有一个示例数据集,其列的读取方式如下:

Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
Run Code Online (Sandbox Code Playgroud)

我想做的就是把它换成两个因素 - "糖果"和"非糖果".我可以用Python/Pandas做到这一点,但似乎无法弄清楚基于dplyr的解决方案.谢谢!

lee*_*sej 26

dplyrtidyr

dat %>% 
    mutate(var = replace(var, var != "Candy", "Not Candy"))
Run Code Online (Sandbox Code Playgroud)

明显快于ifelse方法.创建初始数据库的代码如下:

library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
Run Code Online (Sandbox Code Playgroud)

  • 不是有一个函数不需要重复`var`吗? (2认同)

eip*_*i10 6

假设您的数据框为dat,列为var

dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
Run Code Online (Sandbox Code Playgroud)


Mic*_*ico 6

没必要dplyr.假设var已经存储为一个因素:

non_c <- setdiff(levels(dat$var), "Candy")

levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
Run Code Online (Sandbox Code Playgroud)

?levels.

这是很多比更有效ifelse的办法,这必然是缓慢的:

library(microbenchmark)
set.seed(01239)
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"

times <- 
  replicate(50, 
            {cop <- smp
            s <- get_nanotime()
            levs <- setdiff(levels(cop$var), "Candy")
            levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
            d1 <- get_nanotime() - s
            cop <- smp
            s <- get_nanotime()
            cop = cop %>%
              mutate(candy.flag = factor(ifelse(var == "Candy", 
                                                "Candy", "Non-Candy")))
            d2 <- get_nanotime() - s
            cop <- smp
            s <- get_nanotime()
            cop$var <- 
              factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
            d3 <- get_nanotime() - s
            c(levels = d1, dplyr = d2, direct = d3)})

(x <- apply(times, 1, median))[2]/x[1]
#    dplyr   direct 
# 8.894303 4.962791 
Run Code Online (Sandbox Code Playgroud)

也就是说,这快9倍.

  • 或`factor(dat $ var ==“ Candy”,标签= c(“ Non-Candy”,“ Candy”))`,但我认为重置级别是一种不错的方法。 (2认同)

PhJ*_*PhJ 6

用另一种解决方案dplyr使用case_when

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           TRUE ~ 'Non-Candy'))
Run Code Online (Sandbox Code Playgroud)

的语法case_whencondition ~ value to replace。文档在这里

可能比使用 的解决方案效率低replace,但优点是可以在单个命令中执行多次替换,同时仍然具有很好的可读性,即替换以产生三个级别:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           var == 'Water' ~ 'Water',
                           TRUE ~ 'Neither-Water-Nor-Candy'))
Run Code Online (Sandbox Code Playgroud)