R:因子水平,重新编码为'其他'

ako*_*ako 11 r r-factor

我使用的因素很少,并且通常会发现它们易于理解,但我常常对特定操作的细节模糊不清.目前,我正在编写/折叠类别,几乎没有观察到"其他",我正在寻找一个快速的方法来做到这一点 - 我有一个或许20级的变量,但我有兴趣将它们中的一堆折叠成一个.

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))
Run Code Online (Sandbox Code Playgroud)

以下是我感兴趣的级别,以及它们在不同向量中的标签.

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')
Run Code Online (Sandbox Code Playgroud)

我可以使用该factor()调用,将它们全部枚举,每次类别几乎没有观察时将其分类为"其他".

假设上面top8top8_desc上面是实际的前8位,那么声明data$naics为因子变量的最佳方法是什么,以便对值进行相应top8的编码并将其他所有内容重新编码为other

kit*_*ith 6

我认为最简单的方法是将不在前8位的所有naics重新标记为特殊值.

data$naics[!(data$naics %in% top8)] = -99
Run Code Online (Sandbox Code Playgroud)

然后,您可以在将其转换为因子时使用"排除"选项

factor(data$naics, exclude=-99)
Run Code Online (Sandbox Code Playgroud)


sbh*_*bha 5

您可以使用forcats::fct_other()

library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')
Run Code Online (Sandbox Code Playgroud)

fct_other()用作 a 的一部分dplyr::mutate()

library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 

data %>% head(10)
   employees  naics
1        420  other
2        264  other
3        189  other
4        157 621610
5        376 621610
6        236  other
7        658 621320
8        959 621320
9        216  other
10       156  other
Run Code Online (Sandbox Code Playgroud)

请注意,如果other_level未设置参数,则其他级别默认为“其他”(大写“O”)。

相反,如果您只想将几个因素转换为“其他”,则可以改用参数drop

data %>%  
  mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
         drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
  head(10)

   employees  naics keep_fct drop_fct
1        474 621491    other   621491
2        805 621111   621111    other
3        434 621910    other   621910
4        845 621111   621111    other
5        243 621340    other   621340
6        466 621493    other   621493
7        369 621111   621111    other
8         57 621493    other   621493
9        144 621491    other   621491
10       786 621910    other   621910
Run Code Online (Sandbox Code Playgroud)

dpylr也有recode_factor()你可以将.default参数设置为 other 的地方,但是要重新编码的级别数量更多,就像这个例子一样,可能会很乏味:

data %>% 
   mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))
Run Code Online (Sandbox Code Playgroud)