清理因子水平(折叠多个级别/标签)

Ric*_*rta 53 r factors r-faq

清理包含需要折叠的多个级别的因子的最有效(即有效/适当)方法是什么?也就是说,如何将两个或多个因子级别组合成一个.

这是一个示例,其中"是"和"Y"这两个级别应折叠为"是","否"和"N"折叠为"否":

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS
Run Code Online (Sandbox Code Playgroud)

一个选择当然是在手工使用sub和朋友之前清理琴弦.

另一种方法是允许重复标签,然后丢弃它们

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 
Run Code Online (Sandbox Code Playgroud)

但是,有更有效的方法吗?


虽然我知道levelslabels参数应该是向量,但我尝试了列表和命名列表以及命名向量以查看发生了什么不用说,以下没有一个让我更接近我的目标.

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
Run Code Online (Sandbox Code Playgroud)

Aar*_*ica 77

使用该levels函数,并将其传递给命名列表,其名称是级别的所需名称,元素是应重命名的当前名称.

x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No
Run Code Online (Sandbox Code Playgroud)

levels文件中所述; 还看到那里的例子.

value:对于'factor'方法,长度至少为'x'级别的字符串向量,或指定如何重命名级别的命名列表.

这也可以在一行中完成,正如Marek在这里所做的那样:https://stackoverflow.com/a/10432263/210673 ; 该levels<-法术在此说明/sf/answers/734431701/.

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No
Run Code Online (Sandbox Code Playgroud)

  • 同意这是奇怪的,这不能在`因素`内完成; 我不知道更直接的方式,除了使用像Ananda的解决方案或者匹配的东西. (2认同)

Uwe*_*Uwe 18

由于问题的标题是清理因子水平(折叠多个级别/标签),forcats因此为了完整起见,此处也应提及包装.forcats于2016年8月在CRAN上亮相.

有几种便利功能可用于清理因子水平:

x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)
Run Code Online (Sandbox Code Playgroud)

将要素级别折叠为手动定义的组

fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
Run Code Online (Sandbox Code Playgroud)

手动改变因子水平

fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
Run Code Online (Sandbox Code Playgroud)

自动重新标记因子水平,必要时崩溃

fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
Run Code Online (Sandbox Code Playgroud)

请注意,它fct_relabel()适用于因子级别,因此它需要一个因子作为第一个参数.另外两个函数,fct_collapse()fct_recode()接受一个字符向量,它是一个未记录的特征.

首次出现重新排序因子水平

OP给出的预期输出是

[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No
Run Code Online (Sandbox Code Playgroud)

这里的级别按其出现的顺序排序x,与默认级别不同(?factor:默认情况下,因子的级别已排序).

为了与预期输出一致,可以通过fct_inorder() 折叠级别之前使用来实现:

fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
Run Code Online (Sandbox Code Playgroud)

现在,两者都以相同的顺序返回预期输出.


tim*_*tim 9

从 R 3.5.0 (2018-04-23) 开始,您可以用一行清晰简单的代码来完成此操作:

\n\n
x = c("Y", "Y", "Yes", "N", "No", "H") # The \'H\' should be treated as NA\n\ntmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))\ntmp\n# [1] Yes  Yes  Yes  No   No   <NA>\n# Levels: Yes No\n
Run Code Online (Sandbox Code Playgroud)\n\n

1 行,将多个值映射到同一级别,为缺失级别设置 NA" \xe2\x80\x93 h/t @Aaron

\n


A5C*_*2T1 8

也许命名向量作为键可能是有用的:

> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes
Run Code Online (Sandbox Code Playgroud)

这看起来与你上一次尝试非常相似......但是这个有效:-)


Fra*_*ank 5

另一种方法是创建一个包含映射的表:

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes
Run Code Online (Sandbox Code Playgroud)

我更喜欢这种方式,因为它留下了一个易于检查的对象,总结了地图; data.table代码看起来就像该语法中的任何其他连接一样.


当然,如果你不想要fmap总结变化这样的对象,它可能是一个"一线":

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes
Run Code Online (Sandbox Code Playgroud)