rda*_*tor 157 if-statement r case-when dplyr mutate
当突变是有条件的(取决于某些列值的值)时,是否可以使用mutate?
这个例子有助于显示我的意思.
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
a b c d e f
1 1 1 6 6 1 2
2 3 3 3 2 2 3
3 4 4 6 4 4 4
4 6 2 5 5 5 2
5 3 6 3 3 6 2
6 2 7 6 7 7 7
7 5 2 5 2 6 5
8 1 6 3 6 3 2
Run Code Online (Sandbox Code Playgroud)
我希望使用dplyr包找到解决我的问题的方法(是的,我知道这不是应该有效的代码,但我想它的目的很明确)用于创建新列g:
library(dplyr)
df <- mutate(df,
if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2},
if (a == 0 | a == 1 | a == 4 | a == 3 | c == 4) {g = 3})
Run Code Online (Sandbox Code Playgroud)
我正在寻找的代码的结果应该在这个特定的例子中有这个结果:
a b c d e f g
1 1 1 6 6 1 2 3
2 3 3 3 2 2 3 3
3 4 4 6 4 4 4 3
4 6 2 5 5 5 2 NA
5 3 6 3 3 6 2 NA
6 2 7 6 7 7 7 2
7 5 2 5 2 6 5 2
8 1 6 3 6 3 2 3
Run Code Online (Sandbox Code Playgroud)
有没有人知道如何在dplyr中这样做?这个数据框只是一个例子,我正在处理的数据框要大得多.由于它的速度,我试图使用dplyr,但也许还有其他更好的方法来处理这个问题?
G. *_*eck 194
使用 ifelse
df %>%
mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA)))
Run Code Online (Sandbox Code Playgroud)
添加 - if_else:注意,在dplyr 0.5中有一个if_else定义的函数,所以替代方法是替换ifelse为if_else; 但请注意,因为if_else比ifelse条件更严格(条件的两条腿必须具有相同的类型)所以NA在这种情况下必须更换NA_real_.
df %>%
mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
if_else(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA_real_)))
Run Code Online (Sandbox Code Playgroud)
添加 - case_when自发布此问题后,dplyr已添加,case_when因此另一个替代方案是:
df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3,
TRUE ~ NA_real_))
Run Code Online (Sandbox Code Playgroud)
Aru*_*run 52
既然你要求其他更好的方法来处理这个问题,这里有另一种方法data.table:
require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
Run Code Online (Sandbox Code Playgroud)
请注意,条件语句的顺序是相反的,以便g正确获取.g即使在第二次任务期间也没有制作副本- 它已就地替换.
对于较大的数据,这将比使用嵌套 更好的性能if-else,因为它可以评估"是"和"否"的情况,并且嵌套可能更难以读取/维护恕我直言.
这是相对较大数据的基准:
# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT)
# [1] 10000000 6
DF <- as.data.frame(DT)
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
BASE_fun <- function(DF) { # R v3.1.0
transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
system.time(ans1 <- DT_fun(DT))
# user system elapsed
# 2.659 0.420 3.107
system.time(ans2 <- DPLYR_fun(DF))
# user system elapsed
# 11.822 1.075 12.976
system.time(ans3 <- BASE_fun(DF))
# user system elapsed
# 11.676 1.530 13.319
identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE
identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE
Run Code Online (Sandbox Code Playgroud)
不确定这是否是您要求的替代品,但我希望它有所帮助.
Mat*_*fou 34
dplyr现在有一个case_when提供矢量化if 的函数.语法有点奇怪,mosaic:::derivedFactor因为你不能以标准的dplyr方式访问变量,并且需要声明NA的模式,但它要快得多mosaic:::derivedFactor.
df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c == 4 ~ 3L,
TRUE~as.integer(NA)))
Run Code Online (Sandbox Code Playgroud)
编辑:如果你在dplyr::case_when()包的0.7.0版本之前使用,那么你需要在变量名前加上' .$'(例如.$a == 1在里面写case_when).
基准:对于基准(重用Arun的帖子中的功能)和减少样本量:
require(data.table)
require(mosaic)
require(dplyr)
require(microbenchmark)
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)
DPLYR_case_when <- function(DF) {
DF %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c==4 ~ 3L,
TRUE~as.integer(NA)))
}
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
mosa_fun <- function(DF) {
mutate(DF, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
}
microbenchmark(
DT_fun(DT),
DPLYR_fun(DF),
DPLYR_case_when(DF),
mosa_fun(DF),
times=20
)
Run Code Online (Sandbox Code Playgroud)
这给出了:
expr min lq mean median uq max neval
DT_fun(DT) 1.503589 1.626971 2.054825 1.755860 2.292157 3.426192 20
DPLYR_fun(DF) 2.420798 2.596476 3.617092 3.484567 4.184260 6.235367 20
DPLYR_case_when(DF) 2.153481 2.252134 6.124249 2.365763 3.119575 72.344114 20
mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969 20
Run Code Online (Sandbox Code Playgroud)
Jak*_*her 13
包中的derivedFactor函数mosaic似乎旨在处理这个问题.使用此示例,它看起来像:
library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
Run Code Online (Sandbox Code Playgroud)
(如果您希望结果是数字而不是因子,则可以包含derivedFactor在as.numeric呼叫中.)
derivedFactor 也可以用于任意数量的条件.
Ras*_*sen 11
case_when 在以下情况下,现在是SQL风格案例的一个非常干净的实现:
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame") -> df
df %>%
mutate( g = case_when(
a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3
))
Run Code Online (Sandbox Code Playgroud)
使用dplyr 0.7.4
手册:http://dplyr.tidyverse.org/reference/case_when.html
| 归档时间: |
|
| 查看次数: |
148883 次 |
| 最近记录: |