我认为这个功能是你需要的:
outliersper <- function(x){
length(which(x > mean(x) + 3 * sd(x) | x < mean(x) - 3 * sd(x)) ) / length(x)
}
Run Code Online (Sandbox Code Playgroud)
示例数据
#3 outliers here
df <- data.frame(col= c(1000,1000,1000,runif(100)))
#function
> outliersper(df$col)
[1] 0.02912621
Run Code Online (Sandbox Code Playgroud)
验证
> length(which(df$col > (3 * sd(df$col))))
[1] 3
> 3/length(df$col)
[1] 0.02912621
Run Code Online (Sandbox Code Playgroud)
这样的东西,假设x是数据框中的一列?
set.seed(321)
x <- rnorm(10000)
x[x > mean(x) + 3*sd(x) | x < mean(x) - 3*sd(x)]
[1] 3.135843 -3.006514 3.227549 -3.255502 3.065514 3.159309 -3.171849
[8] 3.215432 3.120442 3.352662 3.574360 3.424063 3.126673 -3.024961
[15] -3.153842 -3.263268 -3.032526 3.179344 -3.605372
Run Code Online (Sandbox Code Playgroud)
获得异常值的百分比
outli <- x[x > mean(x) + 3*sd(x) | x < mean(x) - 3*sd(x)]
length(outli) / length(x)
[1] 0.0019
Run Code Online (Sandbox Code Playgroud)
并使这成为一个功能
find_outlier <- function(x, num=3) {
mean(x > mean(x) + num*sd(x) | x < mean(x) - num*sd(x))
}
find_outlier(x)
[1] 0.0019
Run Code Online (Sandbox Code Playgroud)