如何从数据框中的多个列中查找最频繁的值

sho*_*ome 3 r dataframe

我有如下数据框:

S A B C D E 
1 N N N N N
2 N Y Y N N
3 Y N Y N N
4 Y N Y Y Y
Run Code Online (Sandbox Code Playgroud)

我在哪里需要创建一个新列F,其中包含来自A,B,C,D和E多个列的出现次数最多的字符?

输出应如下所示:

 S A B C D E F
 1 N N N N N N
 2 N Y Y N N N
 3 Y N Y N N N
 4 Y N Y Y Y Y
Run Code Online (Sandbox Code Playgroud)

akr*_*run 5

我们可以创建一个Mode函数并将其应用于行

df1$F <- apply(df1[-1], 1, Mode)
df1
#  S A B C D E F
#1 1 N N N N N N
#2 2 N Y Y N N N
#3 3 Y N Y N N N
#4 4 Y N Y Y Y Y
Run Code Online (Sandbox Code Playgroud)

或者另一个选择是

df1$F <- c('N', 'Y')[max.col(table(c(row(df1[-1])), unlist(df1[-1])), 'first')]
Run Code Online (Sandbox Code Playgroud)

哪里

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}
Run Code Online (Sandbox Code Playgroud)

或使用 tidyverse

library(tidyverse)
df1 %>% 
    mutate(F = pmap_chr(.[-1], ~ Mode(c(...))))
Run Code Online (Sandbox Code Playgroud)

或者另一个选择是

gather(df1, key, F, - S) %>% 
     group_by(S, F) %>% 
     summarise(n = n()) %>% 
     slice(which.max(n)) %>% 
     ungroup %>% 
     dplyr::select(F) %>% 
     bind_cols(df1, .)
Run Code Online (Sandbox Code Playgroud)

或者我们转置数据集,Mode按每个列应用,然后将输出作为新列绑定到原始数​​据集

t(df1[-1]) %>%
   as.data.frame %>% 
   summarise_all(Mode) %>% 
   unlist %>%
   bind_cols(df1, F = .)
Run Code Online (Sandbox Code Playgroud)

或一个选项 data.table

library(data.table)
setDT(df1)[,  F := names(which.max(table(unlist(.SD)))), S][]
Run Code Online (Sandbox Code Playgroud)

注意:这些是常规方法,而不仅仅是检查单个案例


如果我们需要一种高效的方法,没有任何方法ifelse,我们也可以通过

df1$F <- c("Y", "N")[(rowSums(df1[-1] == "N") > 2) + 1]
df1$F
#[1] "N" "N" "N" "Y"
Run Code Online (Sandbox Code Playgroud)

或搭配 Reduce

c("Y", "N")[(Reduce(`+`, lapply(df1[-1], `==`, "N")) > 2) + 1]
Run Code Online (Sandbox Code Playgroud)

或另一种方法是

c("Y", "N")[(str_count(do.call(paste0, df1[-1]), "N") > 2) + 1]
Run Code Online (Sandbox Code Playgroud)

数据

df1 <- structure(list(S = 1:4, A = c("N", "N", "Y", "Y"), B = c("N", 
"Y", "N", "N"), C = c("N", "Y", "Y", "Y"), D = c("N", "N", "N", 
"Y"), E = c("N", "N", "N", "Y")), class = "data.frame", row.names = c(NA, 
-4L))
Run Code Online (Sandbox Code Playgroud)