tidyverse r 中的虚拟代码分类/序数变量

Question

tidyverse r 中的虚拟代码分类/序数变量

Jac*_*son 6 r dummy-variable purrr tidyverse

假设我有一个小玩意儿。

library(tidyverse) 
tib <- as.tibble(list(record = c(1:10), 
                      gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                      like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib

    # A tibble: 10 x 3
   record gender like_product
    <int> <fctr>       <fctr>
 1      1      F            2
 2      2      M            1
 3      3      M            2
 4      4      F            3
 5      5      F            4
 6      6      M            2
 7      7      F            4
 8      8      M            4
 9      9      F            4
10     10      M            5

Run Code Online (Sandbox Code Playgroud)

我想用 1 和 0 对我的数据进行虚拟编码，以便数据看起来更多/更少像这样。

# A tibble: 10 x 8
   record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
    <int>    <dbl>    <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
 1      1        0        1              0              0              1              0              0
 2      2        0        1              0              0              0              0              0
 3      3        0        1              0              1              0              0              0
 4      4        0        1              1              0              0              0              0
 5      5        1        0              0              0              0              0              0
 6      6        0        1              0              0              0              0              0
 7      7        0        1              0              0              0              0              0
 8      8        0        1              0              1              0              0              0
 9      9        1        0              0              0              0              0              0
10     10        1        0              0              0              0              0              1

Run Code Online (Sandbox Code Playgroud)

我的工作流程要求我知道虚拟代码的一系列变量（即gender:like_product），但不想手动识别每个变量（可能有数百个变量）。同样，我不想将每个变量的每个级别/唯一值标识为虚拟代码。我最终在寻找一个tidyverse解决方案。

我知道有几种方法可以做到这一点，但没有一种方法完全适合 tidyverse。我知道我可以使用 mutate ...

tib %>%
     mutate(gender_M = ifelse(gender == "M", 1, 0), 
            gender_F = ifelse(gender == "F", 1, 0), 
            like_product_1 = ifelse(like_product == 1, 1, 0), 
            like_product_2 = ifelse(like_product == 2, 1, 0), 
            like_product_3 = ifelse(like_product == 3, 1, 0), 
            like_product_4 = ifelse(like_product == 4, 1, 0), 
            like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
     select(-gender, -like_product)

Run Code Online (Sandbox Code Playgroud)

但这会打破我需要指定每个虚拟编码输出的工作流程规则。

我过去曾使用stats包中的model.matrix 完成此操作。

model.matrix(~ gender + like_product, tib)

Run Code Online (Sandbox Code Playgroud)

简单明了，但我想在 tidyverse 中找到解决方案。 编辑：原因是，我仍然必须指定每个变量，并且能够使用选择助手来指定类似的东西gender:like_product会更受欢迎。

我认为解决方案在 purrr

library(purrr)
dummy_code <- function(x) {
     lvls <- levels(x)
     sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
} 

tib %>%
     map_at(c("gender", "like_product"), dummy_code)

$record
 [1]  1  2  3  4  5  6  7  8  9 10

$gender
# A tibble: 10 x 2
       F     M
   <int> <int>
 1     1     0
 2     0     1
 3     0     1
 4     1     0
 5     1     0
 6     0     1
 7     1     0
 8     0     1
 9     1     0
10     0     1

$like_product
# A tibble: 10 x 5
     `1`   `2`   `3`   `4`   `5`
   <int> <int> <int> <int> <int>
 1     0     1     0     0     0
 2     1     0     0     0     0
 3     0     1     0     0     0
 4     0     0     1     0     0
 5     0     0     0     1     0
 6     0     1     0     0     0
 7     0     0     0     1     0
 8     0     0     0     1     0
 9     0     0     0     1     0
10     0     0     0     0     1

Run Code Online (Sandbox Code Playgroud)

此尝试生成了一个小标题列表，排除变量除外record，我未能成功将它们全部组合回一个小标题。此外，我仍然必须指定每一列，整体看起来很笨重。

有什么更好的想法吗？谢谢！！

Answer 1

phi*_*ver 9

另一种方法model.matrix是使用包recipes。这仍在进行中，尚未包含在 tidyverse 中。在某些时候，它可能/将包含在tidyverse 包中。

我将让您自行阅读食谱，但在此步骤中，step_dummy您可以使用tidyselect包中的特殊选择器（随一起安装recipes），就像您可以在dplyras 中使用的选择器一样starts_with()。我创建了一个小例子来展示这些步骤。

下面的示例代码。

但如果这更方便，我会留给你，因为这已经在评论中指出了。该函数bake()使用 model.matrix 来创建假人。区别主要在于列名，当然还有在所有单独步骤的底层代码中进行的内部检查。

library(recipes)
library(tibble)

tib <- as.tibble(list(record = c(1:10), 
                      gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                      like_product = as.factor(sample(1:5, 10, replace = TRUE))))

dum <- tib %>% 
  recipe(~ .) %>% 
  step_dummy(gender, like_product) %>% 
  prep(training = tib) %>% 
  bake(newdata = tib)

dum

# A tibble: 10 x 6
   record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
    <int>    <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
 1      1       1.              1.              0.              0.              0.
 2      2       1.              1.              0.              0.              0.
 3      3       1.              1.              0.              0.              0.
 4      4       0.              0.              1.              0.              0.
 5      5       0.              0.              0.              0.              0.
 6      6       0.              1.              0.              0.              0.
 7      7       0.              1.              0.              0.              0.
 8      8       0.              0.              0.              1.              0.
 9      9       0.              0.              0.              0.              1.
10     10       1.              0.              0.              0.              0.

Run Code Online (Sandbox Code Playgroud)

这正是我正在寻找的。这是直观的，保留了良好的名称，并使用整洁的选择器。我必须阅读更多有关“食谱”包的内容。谢谢！ (2认同)

归档时间：	7 年，10 月前
查看次数：	3370 次
最近记录：	7 年，10 月前