处理dplyr中的因子变量

tcq*_*inn 10 r dplyr

我有一个包含事件历史记录的数据框,我想通过检查每个ID号的最后一个事件是否与系统中该ID号的当前值匹配来检查其完整性.数据被编码为因子.以下玩具数据框是一个最小的例子:

df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
                 current.grade=as.factor(c("Senior","Senior","Senior","Senior",
                                         "Junior","Junior","Junior",
                                         "Sophomore","Sophomore")),
                 grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
                                   "Freshman","Sophomore","Junior",
                                   "Freshman","Sophomore")))
Run Code Online (Sandbox Code Playgroud)

它给出了输出

> df
  ID current.grade grade.history
1  1        Senior      Freshman
2  1        Senior     Sophomore
3  1        Senior        Junior
4  1        Senior        Senior
5  2        Junior      Freshman
6  2        Junior     Sophomore
7  2        Junior        Junior
8  3     Sophomore      Freshman
9  3     Sophomore     Sophomore
> str(df)
'data.frame':   9 obs. of  3 variables:
 $ ID           : num  1 1 1 1 2 2 2 3 3
 $ current.grade: Factor w/ 3 levels "Junior","Senior",..: 2 2 2 2 1 1 1 3 3
 $ grade.history: Factor w/ 4 levels "Freshman","Junior",..: 1 4 2 3 1 4 2 1 4
Run Code Online (Sandbox Code Playgroud)

我想用来dplyr提取最后一个值grade.history并检查它current.grade:

df.summary <- df %>%
  group_by(ID) %>%
  summarize(current.grade.last=last(current.grade),
            grade.history.last=last(grade.history))
Run Code Online (Sandbox Code Playgroud)

但是,dplyr似乎将因子转换为整数,所以我得到这个:

> df.summary
Source: local data frame [3 x 3]

  ID current.grade.last grade.history.last
1  1                  2                  3
2  2                  1                  2
3  3                  3                  4
> str(df.summary)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  3 variables:
 $ ID                : num  1 2 3
 $ current.grade.last: int  2 1 3
 $ grade.history.last: int  3 2 4
Run Code Online (Sandbox Code Playgroud)

请注意,这些值不对齐,因为原始因子具有不同的级别集.有什么方法可以做到这一点dplyr

我使用的是R3.1.1 dplyr版本和0.3.0.2版本

luk*_*keA 0

我想这在于 R 中对象的本质factor,它是一组具有模式字符“级别”属性的整数代码。解决问题的一种方法:将因子变量包装成as.character

  df.summary <- df %>%
  group_by(ID) %>%
  summarize(current.grade.last=last(as.character(current.grade)),
            grade.history.last=last(as.character(grade.history)))
Run Code Online (Sandbox Code Playgroud)