结合低频率计数

F. *_*. R 4 r categorical-data

尝试通过将低频率计数组合成"其他"类别来折叠名义分类向量:

数据(数据框的列)如下所示,包含所有50个状态的信息:

California
Florida
Alabama
...
Run Code Online (Sandbox Code Playgroud)

table(colname)/length(colname)正确地返回频率,我想要做的是将任何低于给定阈值(比如f = 0.02)的东西混在一起.什么是正确的方法?

A5C*_*2T1 8

从它的声音,像下面这样的东西应该适合你:

condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}
Run Code Online (Sandbox Code Playgroud)

试试看:

## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))

round(prop.table(table(a)), 2)
# a
#    a    A    b    B    c    C    d    D    e    E    f    g    h 
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
#    i    j 
# 0.08 0.07 

a
#  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"

condenseMe(a)
#  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
#  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
# [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
# [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
# [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
# [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
# [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
# [57] "e"     "c"     "Other" "b"   
Run Code Online (Sandbox Code Playgroud)

但请注意,如果您正在处理factors,则应as.character首先转换它们.


Uwe*_*Uwe 5

Hadley Wickham 的forcats软件包(自 2016 年 8 月 29 日起在 CRAN 上提供)有一个方便的功能fct_lump(),可以根据不同的标准将某个因素的水平汇总在一起。

OP 要求将低于 0.02 阈值的因素集中在一起可以通过以下方式实现

set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
Run Code Online (Sandbox Code Playgroud)
 [1] c     d     d     e     j     h     c     h     g     i     g     d    
[13] f     Other g     h     h     a     b     h     e     g     h     b    
[25] d     e     e     g     i     f     d     e     g     c     g     a    
[37] Other i     i     b     i     j     f     d     c     h     Other j    
[49] j     c     Other e     f     a     a     h     e     c     Other b    
Levels: a b c d e f g h i j Other
Run Code Online (Sandbox Code Playgroud)

请注意,此答案中的示例数据已用于比较。


该函数提供了更多的可能性,例如,它可以保留频率最低的 5 个因子水平,并将其他水平集中在一起:

forcats::fct_lump(a, n = -5)
Run Code Online (Sandbox Code Playgroud)
 [1] Other Other Other Other Other Other Other Other Other Other Other Other
[13] Other D     Other Other Other Other Other Other Other Other Other Other
[25] Other Other Other Other Other Other Other Other Other Other Other Other
[37] B     Other Other Other Other Other Other Other Other Other E     Other
[49] Other Other C     Other Other Other Other Other Other Other A     Other
Levels: A B C D E Other
Run Code Online (Sandbox Code Playgroud)


F. *_*. R 1

似乎有效,但它很丑陋。有更优雅的解决方案吗?

collapsecatetgory <- function(x, p) {
levels_len = length(levels(x))
levels(x)[levels_len+1] = 'Other'
y = table(x)/length(x)
y1 = as.vector(y)
y2 = names(y)
y2_len = length(y2)

for (i in 1:y2_len) {
    if (y1[i]<=p){
          x[x==y2[i]] = 'Other'
        }
     }
x <- droplevels(x)
x
}
Run Code Online (Sandbox Code Playgroud)