R - 从数据集中删除所有异常值

Question

R - 从数据集中删除所有异常值

我想做一个从数据集中删除所有异常值的函数。我已经阅读了很多关于此的 Stack Overflow 文章，所以我知道删除异常值的危险。这是我到目前为止所拥有的：

# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# Removes all outliers from a data set
remove_all_outliers <- function(df){
  # We only want the numeric columns
  a<-df[,sapply(df, is.numeric)]
  b<-df[,sapply(df, !is.numeric)]
  a<-lapply(a,function(x) remove_outliers(x))
  d<-merge(a,b)
  d
}

Run Code Online (Sandbox Code Playgroud)

我知道这有一些问题，但如果可以更好地处理任何事情，请纠正我。

!is.numeric() 不是一回事，我应该如何做到这一点？
- 我也试过了 is.numeric==FALSE
is.numeric()将因子转换为整数。我如何防止这种情况？
我做lapply对了吗？
与分离数据集，执行它，然后将其与非数字集合并，是否有更好/更简单的方法来执行 remove_outliers 函数？

Answer 1

raw*_*awr 8

因子是整数，而不是原子整数。

# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Run Code Online (Sandbox Code Playgroud)

您可以按索引替换列，因此您无需创建单独的数据集。只要确保您将相同的数据传递给lapply，例如，您不想做data[, 1:3] <- lapply(data, FUN)我已经做过很多次的事情。

# Removes all outliers from a data set
remove_all_outliers1 <- function(df){
  # We only want the numeric columns
  df[,sapply(df, is.numeric)] <- lapply(df[,sapply(df, is.numeric)], remove_outliers)
  df
}

Run Code Online (Sandbox Code Playgroud)

与上面类似（我认为稍微容易一些），您可以将整个数据集传递给lapply. 还要确保不要

data <- lapply(data, if (x) something else anotherthing)

Run Code Online (Sandbox Code Playgroud)

或者

data[] <- lapply(data, if (x) something)

Run Code Online (Sandbox Code Playgroud)

这也是我犯过很多次的错误

remove_all_outliers2 <- function(df){
  df[] <- lapply(df, function(x) if (is.numeric(x))
    remove_outliers(x) else x)
  df
}

## test
mt <- within(mtcars, {
  mpg <- factor(mpg)
  gear <- letters[1:2]
})
head(mt)

identical(remove_all_outliers1(mt), remove_all_outliers2(mt))
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

您的想法可以通过一些小的调整来发挥作用。!is.numeric可以作为Negate(is.numeric)更详细的function(x) !is.numeric(x)或!sapply(x, is.numeric). 通常，function(function)在开箱即用的 r 中不起作用。

# Removes all outliers from a data set
remove_all_outliers <- function(df){
  # We only want the numeric columns

  ## drop = FALSE in case only one column for either
  a<-df[,sapply(df, is.numeric), drop = FALSE]
  b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]

  ## note brackets
  a[]<-lapply(a, function(x) remove_outliers(x))

  ## stack them back together, not merge
  ## you could merge if you had a unique id, one id per row
  ## then make sure the columns are returned in the original order
  d<-cbind(a,b)
  d[, names(df)]
}

identical(remove_all_outliers2(mt), remove_all_outliers(mt))
# [1] TRUE

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，8 月前
查看次数：	6907 次
最近记录：	9 年，8 月前