在一维观测数据中检测异常值的Pythonic方法

51 python statistics numpy matplotlib statsmodels

对于给定的数据,我想将异常值(由95%confidense级别或95%分位数函数或任何需要的东西定义)设置为nan值.以下是我现在使用的数据和代码.如果有人能够进一步解释我,我会很高兴的.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()
Run Code Online (Sandbox Code Playgroud)

Joe*_*ton 121

使用的问题percentile是识别为异常值的点是样本大小的函数.

有很多方法可以测试异常值,你应该考虑一下如何对它们进行分类.理想情况下,您应该使用先验信息(例如,"高于/低于此值的任何内容都是不切实际的,因为......")

然而,一个常见的,不太不合理的异常值测试是根据它们的"中位绝对偏差"去除点.

这是N维案例的实现(来自一些论文的代码:https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh
Run Code Online (Sandbox Code Playgroud)

这与我之前的答案非常相似,但我想详细说明样本大小效果.

让我们将基于百分位数的离群值检验(类似于@ CTZhu的答案)与各种不同样本量的中位数 - 绝对偏差(MAD)检验进行比较:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


在此输入图像描述


在此输入图像描述


在此输入图像描述

请注意,无论样本大小如何,基于MAD的分类器都能正常工作,而基于百分位数的分类器将样本大小越大,分类越多,无论它们是否实际为异常值.

  • @CTZhu - 很好,特别是如果OP的数据是日志正态分布的话.对于模糊对称的分布,与正态分布的偏差不应该太大,因为对于非对称分布,例如对数正态分布,MAD不是一个好的选择.(虽然你总是可以在日志空间中应用它来解决这个问题.)所有这些都只是为了强调你应该考虑选择的异常值测试. (3认同)
  • @JoeKington你在哪里使用中位数,但`diff`计算为L2范数(`**2`); 中位数是最小化L1范数的值,而在L2范数中,"均值"是中心; 我期待如果你开始在L1标准中位数.你有什么理由`**2'在计算`diff`时会比绝对值更好吗? (3认同)
  • 乔,+ 1,这是一个很好的答案.虽然我想知道,如果OP的数据总是被统一打扰(`random.rand()`),或者大多数时候可能总是跟随其他一些分布.如果数据总是被统一打扰,我不确定使用`MAD`. (2认同)
  • @JoeKington的PDF论文的另类镜像http://www.pdf-archive.com/2016/07/29/outlier-methods-external/outlier-methods-external.pdf (2认同)

小智 15

检测一维数据中的异常值取决于其分布

1- 正态分布:

  1. 数据值几乎均匀分布在预期范围内: 在这种情况下,您可以轻松使用包含均值的所有方法,例如3或2个标准差的置信区间(95%或99.7%)相应的正态分布数据(中心限制)定理和样本均值的样本分布.)我是一种非常有效的方法.在可汗学院统计和概率 - 采样分布库中进行了解释.

另一种方法是预测间隔,如果你想要数据点的置信区间而不是平均值.

  1. 数据值是在一个范围内随机分布的:均值可能不是数据的公平表示,因为平均值很容易受到异常值的影响(数据集中非常小或大的值不典型)中位数是另一种方式测量数值数据集的中心.

    中位数绝对偏差 - 一种测量距离中位数距离中位数距离的方法 http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - 有一个很好的解释在Joe Kington上面的回答中解释道

2 - 对称分布:如果z分数计算和阈值相应更改,则中位数绝对偏差是一种很好的方法

说明:http: //eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3 - 不对称分布:双MAD - 上述链接中的双中位绝对偏差说明

附加我的python代码以供参考:

 def is_outlier_doubleMAD(self,points):
    """
    FOR ASSYMMETRIC DISTRIBUTION
    Returns : filtered array excluding the outliers

    Parameters : the actual data Points array

    Calculates median to divide data into 2 halves.(skew conditions handled)
    Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) 
    Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
    """

    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    medianIndex = (points.size/2)

    leftData = np.copy(points[0:medianIndex])
    rightData = np.copy(points[medianIndex:points.size])

    median1 = np.median(leftData, axis=0)
    diff1 = np.sum((leftData - median1)**2, axis=-1)
    diff1 = np.sqrt(diff1)

    median2 = np.median(rightData, axis=0)
    diff2 = np.sum((rightData - median2)**2, axis=-1)
    diff2 = np.sqrt(diff2)

    med_abs_deviation1 = max(np.median(diff1),0.000001)
    med_abs_deviation2 = max(np.median(diff2),0.000001)

    threshold1 = ((median-median1)/med_abs_deviation1)*3
    threshold2 = ((median2-median)/med_abs_deviation2)*3

    #if any threshold is 0 -> no outliers
    if threshold1==0:
        threshold1 = sys.maxint
    if threshold2==0:
        threshold2 = sys.maxint
    #multiplied by a factor so that only the outermost points are removed
    modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
    modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2

    filtered1 = []
    i = 0
    for data in modified_z_score1:
        if data < threshold1:
            filtered1.append(leftData[i])
        i += 1
    i = 0
    filtered2 = []
    for data in modified_z_score2:
        if data < threshold2:
            filtered2.append(rightData[i])
        i += 1

    filtered = filtered1 + filtered2
    return filtered
Run Code Online (Sandbox Code Playgroud)

  • 在 Python 3 中,它应该是 `medianIndex = int(points.size/2)`。此外,如果我运行代码并将阈值设置为零,它会崩溃并显示消息“名称‘系统’未定义”。Laslty,函数调用中的“self”永远不会被使用。 (2认同)

ser*_*eyf 12

我已经改编了来自http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers的代码,它给出了与Joe Kington相同的结果,但是使用L1距离而不是L2距离,并且支持不对称分布.原始的R代码没有Joe的0.6745乘数,所以我还在此线程中添加了一致性.不是100%确定是否有必要,但是进行比较是苹果对苹果.

def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when 
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh
Run Code Online (Sandbox Code Playgroud)


CT *_*Zhu 3

np.percentile按照@Martin的建议使用:

percentiles = np.percentile(data, [2.5, 97.5])

# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]

# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan
Run Code Online (Sandbox Code Playgroud)