排除 R 中基于多列的异常值?IQR法

M_O*_*ord 1 r outliers iqr

我目前正在尝试根据选定变量的子集排除异常值,目的是执行敏感性分析。我已经调整了此处可用的函数:计算 R 中的异常值),但到目前为止尚未成功(我仍然是 R 新手用户)。如果您有任何建议,请告诉我!

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
                 measure1 = rnorm(11, mean = 8, sd = 4),
                 measure2 = rnorm(11, mean = 40, sd = 5),
                 measure3 = rnorm(11, mean = 20, sd = 2),
                 measure4 = rnorm(11, mean = 9, sd = 3))

vars_of_interest <- c("measure1", "measure3", "measure4")

# define a function to remove outliers
FindOutliers <- function(data) {
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}

# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])

# remove the outliers
testData <- testData[-temp]

# show the data with the outliers removed
testData
Run Code Online (Sandbox Code Playgroud)

asa*_*ica 6

分开关注点:

  1. 使用 IQR 方法识别数值向量中的异常值。这可以封装在一个采用向量的函数中。
  2. 从 data.frame 的几列中删除异常值。这是一个采用 data.frame 的函数。

我建议返回布尔向量而不是索引。这样,返回的值就是数据的大小,这使得创建新列变得容易,例如df$outlier <- is_outlier(df$measure1)

请注意参数名称如何清楚地表明需要哪种类型的输入:x是数字向量的标准名称,df显然是一个 data.frame。cols可能是列名称的列表或向量。

我特意只使用基础 R,但在现实生活中我会使用该dplyr包来操作 data.frame。

#' Detect outliers using IQR method
#' 
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#' 
is_outlier <- function(x, na.rm = FALSE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

#' Remove rows with outliers in given columns
#' 
#' Any row with at least 1 outlier will be removed
#' 
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#' 
#' 
remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}
Run Code Online (Sandbox Code Playgroud)

有了这两个功能,事情就变得非常简单:

#' Detect outliers using IQR method
#' 
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#' 
is_outlier <- function(x, na.rm = FALSE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

#' Remove rows with outliers in given columns
#' 
#' Any row with at least 1 outlier will be removed
#' 
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#' 
#' 
remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}
Run Code Online (Sandbox Code Playgroud)

由reprex 包(v0.3.0)于 2020-03-23 创建