tidyverse 中变量的一堆重新编码（功能/元编程）

Tim*_*Fan 4 r recode purrr tidyverse

我想用尽可能少的函数调用重新编码一堆变量。我有一个 data.frame，我想在其中重新编码许多变量。我创建了一个包含所有变量名称和要执行的重新编码参数的命名列表。在这里我使用mapand没有问题dpylr。但是，当涉及到重新编码时，我发现recode从car包中使用它要容易得多，而不是使用它dpylr自己的重新编码功能。一个附带问题是是否有一种很好的方法可以用dplyr::recode.

作为下一步，我将 data.frame 分解为嵌套的 tibble。在这里，我想在每个子集中进行特定的重新编码。这就是事情变得复杂的地方，我无法再在dpylr管道中做到这一点。我唯一能工作的是一个非常丑陋的嵌套for loop.

寻找以一种漂亮而干净的方式做到这一点的想法。

让我们从简单的例子开始：

library(carData)
library(dplyr)
library(purrr)
library(tidyr)

# global recode list
recode_ls = list(

  mar = "'not married' = 0;
          'married' = 1",

  wexp = "'no' = 0;
          'yes' = 1"
)

recode_vars <- names(Rossi)[names(Rossi) %in% names(recode_ls)]

Rossi2 <- Rossi # lets save results under a different name

Rossi2[,recode_vars] <- recode_vars %>% map(~ car::recode(Rossi[[.x]],
                                                          recode_ls[.x],
                                                          as.factor = FALSE,
                                                          as.numeric = TRUE))

Run Code Online (Sandbox Code Playgroud)

到目前为止，这对我来说似乎很干净，除了 car::recode 比 dplyr::recode 更容易使用的事实。

我的实际问题来了。我想要做的是重新编码（在这个简单的例子中）变量mar并wexp在每个 tibble 子集中进行不同的编码。在我的真实数据集中，我想在每个子集中重新编码的变量更多，并且名称也不同。有没有人有一个好主意如何使用dpylr管道和清洁来做到这一点map？

    nested_rossi <- as_tibble(Rossi) %>% nest(-race)

    recode_wexp_ls = list(

      no = list(

      mar = "'not married' = 0;
             'married' = 1",

      wexp = "'no' = 0;
              'yes' = 1"
      ),

      yes = list(
        mar = "'not married' = 1;
               'married' = 2",

        wexp = "'no' = 1;
                'yes' = 2"
      )

Run Code Online (Sandbox Code Playgroud)

我们也可以将列表附加到嵌套的 data.frame，但我不确定这是否会使事情变得更有效率。

nested_rossi$recode = list(

          no = list(

          mar = "'not married' = 0;
                 'married' = 1",

          wexp = "'no' = 0;
                  'yes' = 1"
          ),

          yes = list(
            mar = "'not married' = 1;
                   'married' = 2",

            wexp = "'no' = 1;
                    'yes' = 2"
          )
        )

Run Code Online (Sandbox Code Playgroud)

感谢您提出一个很酷的问题！这是一个很好的机会来使用元编程的所有力量。

首先，让我们检查一下recode()函数。它获取一个向量和任意数量的（命名的）参数，并返回用函数参数替换值的相同向量：

x <- c("a", "b", "c")
recode(x, a = "Z", c = "X")

#> [1] "Z" "b" "X"

Run Code Online (Sandbox Code Playgroud)

recode的帮助说我们可以使用 unquote splicing ( !!!) 将命名列表传递给它。

x_codes <- list(a = "Z", c = "X")
recode(x, !!!x_codes)

#> [1] "Z" "b" "X"

Run Code Online (Sandbox Code Playgroud)

当改变数据帧时可以使用这种能力。建议，我们有一个 Rossi 数据集的子集：

library(carData)
library(tidyverse)

rossi <- Rossi %>% 
  as_tibble() %>% 
  select(mar, wexp)

Run Code Online (Sandbox Code Playgroud)

要在单个函数调用中改变两个变量，我们可以使用此代码段（请注意，命名参数和不带引号的拼接方法都可以很好地工作）：

mar_codes <- list(`not married` = 0, married = 1)
wexp_codes <- list(no = 0, yes = 1)

rossi %>% 
  mutate(
    mar_code = recode(mar, "not married" = 0, "married" = 1),
    wexp_code = recode(wexp, !!!wexp_codes)
  )

#> # A tibble: 432 x 4
#>    mar         wexp  mar_code wexp_code
#>    <fct>       <fct>    <dbl>     <dbl>
#>  1 not married no           0         0
#>  2 not married no           0         0
#>  3 not married yes          0         1
#>  4 married     yes          1         1
#>  5 not married yes          0         1

Run Code Online (Sandbox Code Playgroud)

因此，在非标准评估环境中，不加引号拼接是将多个参数传递给函数的好方法。

现在建议我们有一个代码列表列表：

mapping <- list(mar = mar_codes, wexp = wexp_codes)
mapping

#> $mar
#> $mar$`not married`
#> [1] 0

#> $mar$married
#> [1] 1

#> $wexp
#> $wexp$no
#> [1] 0

#> $wexp$yes
#> [1] 1

Run Code Online (Sandbox Code Playgroud)

我们需要的是将这个列表转换为要放在里面的表达式列表mutate()：

expressions <- mapping %>% 
  imap(
    ~ quo(
      recode(!!sym(.y), !!!.x)
    )
  )

expressions

#> $mar
#> <quosure>
#> expr: ^recode(mar, not married = 0, married = 1)
#> env:  0x7fbf374513c0

#> $wexp
#> <quosure>
#> expr: ^recode(wexp, no = 0, yes = 1)
#> env:  0x7fbf37453468

Run Code Online (Sandbox Code Playgroud)

最后一步。在 mutate 中传递这个表达式列表，看看它会做什么：

mutate(rossi, !!!expressions)

#> # A tibble: 432 x 2
#>      mar  wexp
#>    <dbl> <dbl>
#>  1     0     0
#>  2     0     0
#>  3     0     1
#>  4     1     1
#>  5     0     1

Run Code Online (Sandbox Code Playgroud)

现在，您可以扩大要重新编码的变量列表，同时处理多个列表等等。

使用如此强大的技术（元编程），您可以做出惊人的事情。我强烈建议您深入研究这个主题。没有比Hadley Wickham 的 Advanced R 书更好的入门资源了。

希望，这就是你一直在寻找的。

更新

潜得更深。问题是：如何将这种技术应用于 tibble-column？

让我们创建嵌套的group和df（我们要重新编码的数据）

rossi <- 
  head(Rossi, 5) %>% 
  as_tibble() %>% 
  select(mar, wexp)

nested <- tibble(group = c("yes", "no"), df = list(rossi))

Run Code Online (Sandbox Code Playgroud)

nested 好像：

# A tibble: 2 x 2
  group df              
  <chr> <list>          
1 yes   <tibble [5 × 2]>
2 no    <tibble [5 × 2]>

Run Code Online (Sandbox Code Playgroud)

我们已经知道如何从代码列表中构建表达式列表。让我们创建一个函数来为我们处理它。

# A tibble: 2 x 2
  group df              
  <chr> <list>          
1 yes   <tibble [5 × 2]>
2 no    <tibble [5 × 2]>

Run Code Online (Sandbox Code Playgroud)

在那里，list_of_codes参数是每个需要重新编码的变量的命名列表。

假设我们有一个多个 recodings 的列表codes，我们可以将它转换为多个表达式列表的列表。每个列表中的变量数量可以是任意的。

build_recode_expressions <- function(list_of_codes) {
  imap(list_of_codes, ~ quo(recode(!!sym(.y), !!!.x)))
}

Run Code Online (Sandbox Code Playgroud)

现在我们可以轻松地将exprs新的列表列添加到嵌套数据框中。

还有另一个功能可能对进一步的工作有用。此函数采用一个数据框和一个带引号的表达式列表，并返回一个带有重新编码列的新数据框。

codes <- list(
  yes = list(mar = list(`not married` = 0, married = 1)),
  no = list(
    mar = list(`not married` = 10, married = 20), 
    wexp = list(no = "NOOOO", yes = "YEEEES")
  )
)

exprs <- map(codes, build_recode_expressions)

Run Code Online (Sandbox Code Playgroud)

是时候把所有东西结合在一起了。我们有 tibble-column df、 list-columnexprs和recode_df将它们一一绑定在一起的函数。

线索是map2功能。它允许我们同时迭代两个列表：

recode_df <- function(df, exprs) mutate(df, !!!exprs)

Run Code Online (Sandbox Code Playgroud)

这是输出：

# A tibble: 10 x 5
   group mar         wexp   mar1 wexp1 
   <chr> <fct>       <fct> <dbl> <chr> 
 1 yes   not married no        0 no    
 2 yes   not married no        0 no    
 3 yes   not married yes       0 yes   
 4 yes   married     yes       1 yes   
 5 yes   not married yes       0 yes   
 6 no    not married no       10 NOOOO 
 7 no    not married no       10 NOOOO 
 8 no    not married yes      10 YEEEES
 9 no    married     yes      20 YEEEES
10 no    not married yes      10 YEEEES

Run Code Online (Sandbox Code Playgroud)

我希望此更新可以解决您的问题。

归档时间：	6 年，8 月前
查看次数：	1379 次
最近记录：	6 年，8 月前

有没有办法使用read.csv读取字符串值而不是R中的文件？ 76

在dplyr中的mutate_at中按名称排除列 22

计算 R 列中出现次数的相似度 19

理解R中xgboost的num_classes 15

ggplot图表显示类别内观察的比例 13

如何通过do函数strsplit某些列中不同数量的字符串 12

使用正则表达式在折叠的单词之间插入空格 11

R6类的多重继承 11

从列表到data.table与hash的R快速单项查找 10

使用 purrr::map 重写向量，如 for 循环 3

为什么Java的+ =, - =,*=,/ =复合赋值运算符需要转换？ 3547

是否可以将CSS应用于角色的一半？ 2717

我应该在MySQL中使用日期时间或时间戳数据类型吗？ 2598

Python join:为什么是string.join(list)而不是list.join(string)？ 1669

LINQ中的多个"order by" 1537

适用于PDF文件的MIME媒体类型 1229

Tab键== 4个空格并在Vim中的花括号后自动缩进 1224

在HTML中显示哪些字符可用于上/下三角(没有词干的箭头)？ 1212

我怎么知道分支是否已经合并为主分支？ 1077

如何在Java中创建通用数组？ 1045