F. *_*. R 4 r categorical-data
尝试通过将低频率计数组合成"其他"类别来折叠名义分类向量:
数据(数据框的列)如下所示,包含所有50个状态的信息:
California
Florida
Alabama
...
Run Code Online (Sandbox Code Playgroud)
table(colname)/length(colname)正确地返回频率,我想要做的是将任何低于给定阈值(比如f = 0.02)的东西混在一起.什么是正确的方法?
从它的声音,像下面这样的东西应该适合你:
condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
toCondense <- names(which(prop.table(table(vector)) < threshold))
vector[vector %in% toCondense] <- newName
vector
}
Run Code Online (Sandbox Code Playgroud)
试试看:
## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
round(prop.table(table(a)), 2)
# a
# a A b B c C d D e E f g h
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13
# i j
# 0.08 0.07
a
# [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"
condenseMe(a)
# [1] "c" "d" "d" "e" "j" "h" "c" "h"
# [9] "g" "i" "g" "d" "f" "Other" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b"
# [25] "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "Other" "i" "i" "b"
# [41] "i" "j" "f" "d" "c" "h" "Other" "j"
# [49] "j" "c" "Other" "e" "f" "a" "a" "h"
# [57] "e" "c" "Other" "b"
Run Code Online (Sandbox Code Playgroud)
但请注意,如果您正在处理factors,则应as.character首先转换它们.
Hadley Wickham 的forcats软件包(自 2016 年 8 月 29 日起在 CRAN 上提供)有一个方便的功能fct_lump(),可以根据不同的标准将某个因素的水平汇总在一起。
OP 要求将低于 0.02 阈值的因素集中在一起可以通过以下方式实现
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)[1] c d d e j h c h g i g d [13] f Other g h h a b h e g h b [25] d e e g i f d e g c g a [37] Other i i b i j f d c h Other j [49] j c Other e f a a h e c Other b Levels: a b c d e f g h i j Other
请注意,此答案中的示例数据已用于比较。
该函数提供了更多的可能性,例如,它可以保留频率最低的 5 个因子水平,并将其他水平集中在一起:
forcats::fct_lump(a, n = -5)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)[1] Other Other Other Other Other Other Other Other Other Other Other Other [13] Other D Other Other Other Other Other Other Other Other Other Other [25] Other Other Other Other Other Other Other Other Other Other Other Other [37] B Other Other Other Other Other Other Other Other Other E Other [49] Other Other C Other Other Other Other Other Other Other A Other Levels: A B C D E Other
似乎有效,但它很丑陋。有更优雅的解决方案吗?
collapsecatetgory <- function(x, p) {
levels_len = length(levels(x))
levels(x)[levels_len+1] = 'Other'
y = table(x)/length(x)
y1 = as.vector(y)
y2 = names(y)
y2_len = length(y2)
for (i in 1:y2_len) {
if (y1[i]<=p){
x[x==y2[i]] = 'Other'
}
}
x <- droplevels(x)
x
}
Run Code Online (Sandbox Code Playgroud)