替换R中缺失值的平均值或模式

use*_*102 3 r missing-data

我有一个由混合数据类型(数字,字符,因子,序数因子)组成的大型数据库,缺少值,我试图创建一个for循环,使用相应列的平均值替换缺失值,如果数字或字符/因子的模式.

这就是我现在所拥有的:

#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
        Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
} 
}
Run Code Online (Sandbox Code Playgroud)

"模式"的功能是:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1)
        xmode <- ">1 mode"
    return(xmode)
}
Run Code Online (Sandbox Code Playgroud)

它似乎只是忽略了语句,没有给出任何错误......我也尝试用索引来处理第一部分:

## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]
Run Code Online (Sandbox Code Playgroud)

但我得到这个错误:"colMeans中的错误(df_test,na.rm = TRUE):'x'必须是数字"

有谁知道如何解决这个问题?

非常感谢你们的大力帮助!-F

pet*_*ete 5

如果您只是删除明显的错误,那么它按预期工作:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

print(df_test)

#   age    a      b
# 1   5   aa banana
# 2   8   bb  apple
# 3  10 <NA>   pear
# 4  12   cc  grape
# 5  NA   cc   <NA>

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]) %in% c("character", "factor")) {
        df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
    }
}

print(df_test)

#     age  a       b
# 1  5.00 aa  banana
# 2  8.00 bb   apple
# 3 10.00 cc    pear
# 4 12.00 cc   grape
# 5  8.75 cc >1 mode
Run Code Online (Sandbox Code Playgroud)

我建议您使用带有语法高亮和括号匹配的编辑器,这样可以更容易地找到这些语法错误.