使用来自tidyr的聚集时保留属性(属性不相同)

jos*_*kre 12 r tidyr

我有一个数据框,需要分成两个表来满足Codd的第三范式.在一个简单的例子中,原始数据框看起来像这样:

library(lubridate)
> (df <- data.frame(hh_id = 1:2,
                   income = c(55000, 94000),
                   bday_01 = ymd(c(20150309, 19890211)),
                   bday_02 = ymd(c(19850911, 20000815)),
                   gender_01 = factor(c("M", "F")),
                   gender_02 = factor(c("F", "F"))))

    hh_id income    bday_01    bday_02 gender_01 gender_02
  1     1  55000 2015-03-09 1985-09-11         M         F
  2     2  94000 1989-02-11 2000-08-15         F         F
Run Code Online (Sandbox Code Playgroud)

当我使用聚集函数时,它会警告属性不相同,并且会丢失性别因素和bday的润滑(或实际示例中的其他属性).是否有一个很好的tidyr解决方案,以避免丢失每列的数据类型?

library(tidyr)
> (person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      gather(key, value, -hh_id) %>%
      separate(key, c("key", "per_num"), sep = "_") %>%
      spread(key, value))

     hh_id per_num       bday gender
   1     1      01 1425859200      M
   2     1      02  495244800      F
   3     2      01  603158400      F
   4     2      02  966297600      F

   Warning message:
   attributes are not identical across measure variables; they will be dropped

> lapply(person, class)

  $hh_id
  [1] "integer"

  $per_num
  [1] "character"

  $bday
  [1] "character"

  $gender
  [1] "character"
Run Code Online (Sandbox Code Playgroud)

我可以想象一种方法,通过分别收集具有相同数据类型的每组变量然后连接所有表来实现它,但必须有一个我更缺失的更优雅的解决方案.

Mat*_*rde 15

您可以将日期转换为字符,然后将它们转换回最后的日期:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      mutate_each(funs(as.character), contains('bday')) %>%
      gather(key, value, -hh_id) %>%
      separate(key, c("key", "per_num"), sep = "_") %>%
      spread(key, value) %>%
      mutate(bday=ymd(bday)))

  hh_id per_num       bday gender
1     1      01 2015-03-09      M
2     1      02 1985-09-11      F
3     2      01 1989-02-11      F
4     2      02 2000-08-15      F
Run Code Online (Sandbox Code Playgroud)

或者,如果您使用Date而不是POSIXct,您可以执行以下操作:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
      gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
      mutate(bday=as.Date(bday)) %>%
      mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>%
      filter(per_num1 == per_num2) %>%
      rename(per_num=per_num1) %>%
      select(-per_num2))
Run Code Online (Sandbox Code Playgroud)

编辑

你看到的警告:

Warning: attributes are not identical across measure variables; they will be dropped
Run Code Online (Sandbox Code Playgroud)

来自收集性别列,这些列是因素并具有不同的水平向量(参见参考资料str(df)).如果您要将性别列转换为字符,或者如果您要将其级别与类似的内容同步,

df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))
Run Code Online (Sandbox Code Playgroud)

那么当你执行时你会看到警告消失了

person <- df %>% 
        select(hh_id, bday_01:gender_02) %>% 
        gather(key, value, contains('gender'))
Run Code Online (Sandbox Code Playgroud)