结合低频率计数

Question

结合低频率计数

尝试通过将低频率计数组合成"其他"类别来折叠名义分类向量:

数据(数据框的列)如下所示,包含所有50个状态的信息:

California
Florida
Alabama
...

Run Code Online (Sandbox Code Playgroud)

table(colname)/length(colname)正确地返回频率,我想要做的是将任何低于给定阈值(比如f = 0.02)的东西混在一起.什么是正确的方法？

Answer 1

A5C*_*2T1 8

从它的声音,像下面这样的东西应该适合你:

condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}

Run Code Online (Sandbox Code Playgroud)

试试看:

## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))

round(prop.table(table(a)), 2)
# a
#    a    A    b    B    c    C    d    D    e    E    f    g    h 
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
#    i    j 
# 0.08 0.07 

a
#  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"

condenseMe(a)
#  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
#  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
# [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
# [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
# [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
# [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
# [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
# [57] "e"     "c"     "Other" "b"

Run Code Online (Sandbox Code Playgroud)

但请注意,如果您正在处理factors,则应as.character首先转换它们.

Answer 2

Uwe*_*Uwe 5

Hadley Wickham 的forcats软件包（自 2016 年 8 月 29 日起在 CRAN 上提供）有一个方便的功能fct_lump()，可以根据不同的标准将某个因素的水平汇总在一起。

OP 要求将低于 0.02 阈值的因素集中在一起可以通过以下方式实现

set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)

Run Code Online (Sandbox Code Playgroud)

 [1] c     d     d     e     j     h     c     h     g     i     g     d    
[13] f     Other g     h     h     a     b     h     e     g     h     b    
[25] d     e     e     g     i     f     d     e     g     c     g     a    
[37] Other i     i     b     i     j     f     d     c     h     Other j    
[49] j     c     Other e     f     a     a     h     e     c     Other b    
Levels: a b c d e f g h i j Other

Run Code Online (Sandbox Code Playgroud)

请注意，此答案中的示例数据已用于比较。

该函数提供了更多的可能性，例如，它可以保留频率最低的 5 个因子水平，并将其他水平集中在一起：

forcats::fct_lump(a, n = -5)

Run Code Online (Sandbox Code Playgroud)

 [1] Other Other Other Other Other Other Other Other Other Other Other Other
[13] Other D     Other Other Other Other Other Other Other Other Other Other
[25] Other Other Other Other Other Other Other Other Other Other Other Other
[37] B     Other Other Other Other Other Other Other Other Other E     Other
[49] Other Other C     Other Other Other Other Other Other Other A     Other
Levels: A B C D E Other

Run Code Online (Sandbox Code Playgroud)

Answer 3

F. *_*. R 1

似乎有效，但它很丑陋。有更优雅的解决方案吗？

collapsecatetgory <- function(x, p) {
levels_len = length(levels(x))
levels(x)[levels_len+1] = 'Other'
y = table(x)/length(x)
y1 = as.vector(y)
y2 = names(y)
y2_len = length(y2)

for (i in 1:y2_len) {
    if (y1[i]<=p){
          x[x==y2[i]] = 'Other'
        }
     }
x <- droplevels(x)
x
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	2645 次
最近记录：	7 年，8 月前