dplyr对行的子集进行mutate/replace

Chr*_*ton 65 r dplyr data.table

我正在尝试基于dplyr的工作流程(而不是主要使用我习惯的data.table),而且我遇到了一个问题,我无法找到一个等效的dplyr解决方案.我经常遇到需要根据单个条件有条件地更新/替换多个列的场景.这是一些示例代码,我的data.table解决方案:

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

Run Code Online (Sandbox Code Playgroud)

是否有一个简单的dplyr解决方案来解决同样的问题？我想避免使用ifelse,因为我不想多次输入条件 - 这是一个简化的例子,但有时很多基于单个条件的赋值.

在此先感谢您的帮助!

这些解决方案(1)维护管道,(2)不覆盖输入,(3)只需要指定条件一次:

1a)mutate_cond为可以合并到管道中的数据帧或数据表创建一个简单的函数.此函数类似mutate但仅作用于满足条件的行:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

Run Code Online (Sandbox Code Playgroud)

1b)mutate_last这是数据帧或数据表的替代函数,它又类似mutate但仅在其中使用group_by(如下例所示),并且仅对最后一个组而不是每个组进行操作.请注意,TRUE> FALSE,因此如果group_by指定条件,则mutate_last仅对满足该条件的行进行操作.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

Run Code Online (Sandbox Code Playgroud)

2)因子分解条件通过使条件成为额外的列而将其删除后将其排除.然后使用ifelse,replace或如图所示使用逻辑算术.这也适用于数据表.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

Run Code Online (Sandbox Code Playgroud)

3)sqldf我们可以update通过管道中的sqldf包使用SQL 来获取数据帧(但不是数据表,除非我们转换它们 - 这可能代表dplyr中的一个错误.请参阅dplyr issue 1579).可能看起来由于存在而不希望地修改该代码中的输入,update但实际上update是在临时生成的数据库中而不是在实际输入上作用于输入的副本.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

Run Code Online (Sandbox Code Playgroud)

注1:我们用它作为DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

Run Code Online (Sandbox Code Playgroud)

注2:如何轻松地更新指定行的子集的问题也讨论了dplyr问题134,631,1518和1573与631是主线程和1573是这里的答案进行了审查.

已经过去几年了，github 问题似乎已经关闭并锁定了。这个问题官方有解决办法吗？ (3认同)

你可以用magrittr双向管道做到这一点%<>%:

library(dplyr)
library(magrittr)

dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                    cf = 0,  
                                    delta.watts = 13)

Run Code Online (Sandbox Code Playgroud)

这减少了打字的数量,但仍然慢得多data.table.

这是我喜欢的解决方案:

mutate_when <- function(data, ...) {
  dots <- eval(substitute(alist(...)))
  for (i in seq(1, length(dots), by = 2)) {
    condition <- eval(dots[[i]], envir = data)
    mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
    data[condition, names(mutations)] <- mutations
  }
  data
}

Run Code Online (Sandbox Code Playgroud)

它可以让你写出像

mtcars %>% mutate_when(
  mpg > 22,    list(cyl = 100),
  disp == 160, list(cyl = 200)
)

Run Code Online (Sandbox Code Playgroud)

这是非常易读的 - 尽管它可能没有那么高效.

正如上面的eipi10所示,没有一种简单的方法可以在dplyr中进行子集替换,因为DT使用pass-by-reference语法与使用pass-by-value的dplyr.dplyr需要在ifelse()整个向量上使用,而DT将执行子集并通过引用更新(返回整个DT).因此,对于本练习,DT将大大加快.

您可以选择先进行子集,然后进行更新,最后重新组合:

dt.sub <- dt[dt$measure == "exit",] %>%
  mutate(qty.exit= qty, cf= 0, delta.watts= 13)

dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

Run Code Online (Sandbox Code Playgroud)

但DT会快得多:(编辑使用eipi10的新答案)

library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit', 
                            `:=`(qty.exit = qty,
                                 cf = 0,
                                 delta.watts = 13)]},
               eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                cf = 0,  
                                delta.watts = 13)},
               alex= {dt.sub <- dt[dt$measure == "exit",] %>%
                 mutate(qty.exit= qty, cf= 0, delta.watts= 13)

               dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})


Unit: microseconds
expr      min        lq      mean   median       uq      max neval cld
     dt  591.480  672.2565  747.0771  743.341  780.973 1837.539   100  a 
 eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509   100   b
   alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427   100   b

Run Code Online (Sandbox Code Playgroud)

我偶然发现了这个,非常喜欢mutate_cond()@G.格洛腾迪克,但认为处理新变量可能会派上用场.所以,下面有两个补充:

无关:倒数第二行做多一点dplyr使用filter()

开头的三个新行获取用于的变量名称mutate(),并在mutate()发生之前初始化数据框中的任何新变量.新的变量初始化为的剩余data.frame使用new_init,它被设置为缺失(NA)作为默认值.

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

Run Code Online (Sandbox Code Playgroud)

以下是使用虹膜数据的一些示例:

更改Petal.Length到88何处Species == "setosa".这将适用于原始功能以及此新版本.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

Run Code Online (Sandbox Code Playgroud)

与上面相同,但也创建一个新变量x(NA不包括在条件中的行).以前不可能.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

Run Code Online (Sandbox Code Playgroud)

与上面相同,但条件中未包含的行x设置为FALSE.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

Run Code Online (Sandbox Code Playgroud)

此示例显示如何new_init将a设置list为初始化具有不同值的多个新变量.在这里,两个新的变量与排除行创建使用不同的值被初始化(x初始化为FALSE,y如NA)

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))

Run Code Online (Sandbox Code Playgroud)

您可以拆分数据集并对部分进行常规的 mutate 调用TRUE。

\n

分割可以用或dplyr::group_split()来完成，我更喜欢这里的基本版本，因为它保留了名称，请参阅https://github.com/tidyverse/dplyr/issues/4223base::split()上的讨论。

\n

df1 <- data.frame(site = sample(1:6, 50, replace=T),\n                  space = sample(1:4, 50, replace=T),\n                  measure = sample(c(\'cfl\', \'led\', \'linear\', \'exit\'), 50, \n                                   replace=T),\n                  qty = round(runif(50) * 30),\n                  qty.exit = 0,\n                  delta.watts = sample(10.5:100.5, 50, replace=T),\n                  cf = runif(50),\n                  stringsAsFactors = F)\n

Run Code Online (Sandbox Code Playgroud)\n

library(tidyverse)\ndf1 %>%\n  group_split(measure == "exit", .keep = FALSE) %>% \n  modify_at(2, ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%\n  bind_rows()\n#> # A tibble: 50 \xc3\x97 7\n#>     site space measure   qty qty.exit delta.watts    cf\n#>    <int> <int> <chr>   <dbl>    <dbl>       <dbl> <dbl>\n#>  1     5     1 linear     22        0       100.  0.126\n#>  2     3     3 led        12        0        61.5 0.161\n#>  3     6     1 led        26        0        25.5 0.307\n#>  4     5     2 cfl        16        0        26.5 0.865\n#>  5     6     3 linear     19        0        57.5 0.684\n#>  6     1     4 led        12        0        14.5 0.802\n#>  7     6     4 led         5        0        90.5 0.547\n#>  8     5     4 linear     28        0        54.5 0.171\n#>  9     1     2 linear      5        0        24.5 0.775\n#> 10     1     2 cfl        24        0        96.5 0.144\n#> # \xe2\x80\xa6 with 40 more rows\n

Run Code Online (Sandbox Code Playgroud)\n

df1 %>%\n  split(~measure == "exit") %>% \n  modify_at("TRUE", ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%\n  bind_rows()\n#>    site space measure qty qty.exit delta.watts          cf\n#> 1     5     1  linear  22        0       100.5 0.125646491\n#> 2     3     3     led  12        0        61.5 0.160692291\n#> 3     6     1     led  26        0        25.5 0.307239765\n#> 4     5     2     cfl  16        0        26.5 0.864969074\n#> 5     6     3  linear  19        0        57.5 0.683945200\n#> 6     1     4     led  12        0        14.5 0.802398642\n#> 7     6     4     led   5        0        90.5 0.547211378\n#> 8     5     4  linear  28        0        54.5 0.170614207\n#> 9     1     2  linear   5        0        24.5 0.774603932\n#> 10    1     2     cfl  24        0        96.5 0.144310557\n#> 11    3     4  linear  21        0        93.5 0.682622390\n#> 12    4     4     led   2        0        48.5 0.941718646\n#> 13    4     4     cfl   2        0       100.5 0.918448627\n#> 14    5     2     led  11        0        63.5 0.998143780\n#> 15    4     1     led  21        0        53.5 0.644740176\n#> 16    1     3     cfl   5        0        28.5 0.110610285\n#> 17    1     3  linear  24        0        41.5 0.538868200\n#> 18    4     3     led  29        0        19.5 0.998474289\n#> 19    2     3     cfl   4        0        22.5 0.008167536\n#> 20    5     1     led  20        0        56.5 0.740833476\n#> 21    3     2     led   5        0        44.5 0.223967706\n#> 22    1     4     led  27        0        32.5 0.199850583\n#> 23    3     4     cfl  17        0        61.5 0.104023080\n#> 24    1     3     cfl  11        0        34.5 0.399036247\n#> 25    2     3  linear  29        0        65.5 0.600678235\n#> 26    2     4     cfl  23        0        29.5 0.291611352\n#> 27    6     2  linear  13        0        37.5 0.225021614\n#> 28    2     3     led  17        0        62.5 0.879606956\n#> 29    2     4     led  29        0        51.5 0.301759669\n#> 30    5     1     led  11        0        54.5 0.793816856\n#> 31    2     3     led  20        0        29.5 0.514759195\n#> 32    3     4  linear   6        0        68.5 0.475085443\n#> 33    1     4     led  21        0        34.5 0.133207588\n#> 34    2     4  linear  25        0        80.5 0.164279355\n#> 35    5     3     led   7        0        73.5 0.252937836\n#> 36    6     2     led  15        0        99.5 0.554864929\n#> 37    3     2  linear   6        0        44.5 0.377257874\n#> 38    4     4    exit  15       15        13.0 0.000000000\n#> 39    3     3    exit  10       10        13.0 0.000000000\n#> 40    5     1    exit  15       15        13.0 0.000000000\n#> 41    4     2    exit   1        1        13.0 0.000000000\n#> 42    5     3    exit  10       10        13.0 0.000000000\n#> 43    1     3    exit  14       14        13.0 0.000000000\n#> 44    5     2    exit  12       12        13.0 0.000000000\n#> 45    2     2    exit  30       30        13.0 0.000000000\n#> 46    6     3    exit  28       28        13.0 0.000000000\n#> 47    1     1    exit  14       14        13.0 0.000000000\n#> 48    3     3    exit  21       21        13.0 0.000000000\n#> 49    4     2    exit  13       13        13.0 0.000000000\n#> 50    4     3    exit  12       12        13.0 0.000000000\n

Run Code Online (Sandbox Code Playgroud)\n

^{由reprex 包于 2022 年 10 月 7 日创建(v2.0.1)}

\n

mutate_cond是一个很棒的函数，但是如果用于创建条件的列中不存在NA，则会产生错误。我觉得有条件的mutation应该只留下这样的行。这与filter（）的行为相匹配，当条件为TRUE时，filter（）返回行，但是省略了两行都为FALSE和NA。

有了这个小的改动，该功能就可以发挥出魅力：

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

Run Code Online (Sandbox Code Playgroud)

我实际上没有看到任何改变dplyr会让这变得更容易。case_when当一列有多个不同的条件和结果时非常有用，但对于您想要根据一个条件更改多个列的情况没有帮助。同样，recode如果您要替换一列中的多个不同值，但同时替换多列中的多个不同值，则可以节省键入时间。最后mutate_at等仅将条件应用于列名而不是数据框中的行。您可以为 mutate_at 编写一个函数来实现这一点，但我不知道如何让它对不同的列表现不同。

这就是我将如何使用nestformtidyr和mapfrom来处理它purrr。

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

Run Code Online (Sandbox Code Playgroud)

一种简洁的解决方案是对过滤后的子集进行变异，然后添加回表的非退出行：

library(dplyr)

dt %>% 
    filter(measure == 'exit') %>%
    mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
    rbind(dt %>% filter(measure != 'exit'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	32771 次
最近记录：	6 年，4 月前

返回小标题：如何使用 case_when 进行矢量化？ 6

更多相关链接

如何覆盖包命名空间中的不可见函数？ 29

使用outer而不是expand.grid 21

R梦魇:Yosemite,R,RStudio和Homebrew 18

将时间序列转换为数据帧并返回 16

根据数据变量绘制颜色和形状的点 15

在R中查找列表元素 11

将空白插入载体中,例如R中的次要刻度标签 11

生存不承认正确的审查数据 11

使用Tidyr/Dplyr汇总字符串组的计数 5

过滤多列中的精确字符匹配 3

如果__name__ =="__ main__":怎么办？ 5545

使用CSS更改HTML5输入的占位符颜色 3876

如何制作一系列功能装饰器？ 2647

我是否施放了malloc的结果？ 2318

使用jQuery从下拉列表(选择框)中获取所选文本 2240

比较Git中的两个分支？ 2149

如何修改指定的提交？ 2077

为什么我需要一直做`--set-upstream`？ 1364

最优雅的方法来检查Python中的字符串是否为空？ 1282

如何获得最近提交的Git分支列表？ 1197