我有一个示例数据集,其列的读取方式如下:
Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
Run Code Online (Sandbox Code Playgroud)
我想做的就是把它换成两个因素 - "糖果"和"非糖果".我可以用Python/Pandas做到这一点,但似乎无法弄清楚基于dplyr的解决方案.谢谢!
lee*_*sej 26
在dplyr和tidyr
dat %>%
mutate(var = replace(var, var != "Candy", "Not Candy"))
Run Code Online (Sandbox Code Playgroud)
明显快于ifelse方法.创建初始数据库的代码如下:
library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
Run Code Online (Sandbox Code Playgroud)
假设您的数据框为dat,列为var:
dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
Run Code Online (Sandbox Code Playgroud)
没必要dplyr.假设var已经存储为一个因素:
non_c <- setdiff(levels(dat$var), "Candy")
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
Run Code Online (Sandbox Code Playgroud)
见?levels.
这是很多比更有效ifelse的办法,这必然是缓慢的:
library(microbenchmark)
set.seed(01239)
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
times <-
replicate(50,
{cop <- smp
s <- get_nanotime()
levs <- setdiff(levels(cop$var), "Candy")
levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
d1 <- get_nanotime() - s
cop <- smp
s <- get_nanotime()
cop = cop %>%
mutate(candy.flag = factor(ifelse(var == "Candy",
"Candy", "Non-Candy")))
d2 <- get_nanotime() - s
cop <- smp
s <- get_nanotime()
cop$var <-
factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
d3 <- get_nanotime() - s
c(levels = d1, dplyr = d2, direct = d3)})
(x <- apply(times, 1, median))[2]/x[1]
# dplyr direct
# 8.894303 4.962791
Run Code Online (Sandbox Code Playgroud)
也就是说,这快9倍.
用另一种解决方案dplyr使用case_when:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
TRUE ~ 'Non-Candy'))
Run Code Online (Sandbox Code Playgroud)
的语法case_when是condition ~ value to replace。文档在这里。
可能比使用 的解决方案效率低replace,但优点是可以在单个命令中执行多次替换,同时仍然具有很好的可读性,即替换以产生三个级别:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
var == 'Water' ~ 'Water',
TRUE ~ 'Neither-Water-Nor-Candy'))
Run Code Online (Sandbox Code Playgroud)