是否有一个numpy内置来做类似以下的事情?也就是说,获取一个列表d并返回一个列表,filtered_d其中删除了基于某些假定的点分布的任何外围元素d.
import numpy as np
def reject_outliers(data):
m = 2
u = np.mean(data)
s = np.std(data)
filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return filtered
>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]
Run Code Online (Sandbox Code Playgroud)
我说'类似',因为函数可能允许变化的分布(泊松,高斯等)和变化的异常阈值(如m我在这里使用的那样).
Ben*_*ier 167
在处理异常值时,重要的是人们应该尽量使用估计量.分布的均值将受到异常值的偏差,但例如中位数将小得多.
以eumiro的答案为基础:
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
return data[s<m]
Run Code Online (Sandbox Code Playgroud)
在这里,我用更稳健的中位数替换平均值,用中位数的绝对距离替换标准差.然后我按它们的(再次)中值来缩放距离,使其m处于合理的相对比例.
请注意,要使data[s<m]语法起作用,data必须是一个numpy数组.
eum*_*iro 87
这个方法几乎与你的方法相同,只是比较numpyst(也只在numpy数组上工作):
def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]
Run Code Online (Sandbox Code Playgroud)
Yig*_*gal 12
当中间距离的中位数为0时,Benjamin Bannier的答案产生了一个传递,所以我发现这个修改版本对于下面例子中给出的情况更有帮助.
def reject_outliers_2(data, m=2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m]
Run Code Online (Sandbox Code Playgroud)
例:
data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))
Run Code Online (Sandbox Code Playgroud)
得到:
[[10, 10, 10, 17, 10, 10]] # 17 is not filtered
[10, 10, 10, 10, 10] # 17 is filtered (it's distance, 7, is greater than m)
Run Code Online (Sandbox Code Playgroud)
以Benjamin为基础,使用pandas.Series并用IQR替换MAD:
def reject_outliers(sr, iq_range=0.5):
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
return sr[ (sr - median).abs() <= iqr]
Run Code Online (Sandbox Code Playgroud)
例如,如果设置iq_range=0.6,四分位数范围的百分位数将变为:0.20 <--> 0.80,因此将包括更多的异常值.
另一种方法是对标准差进行稳健估计(假设高斯统计)。查找在线计算器,我发现 90% 百分位数对应于 1.2815σ,95% 对应于 1.645σ ( http://vassarstats.net/tabs.html?#z )
举个简单的例子:
import numpy as np
# Create some random numbers
x = np.random.normal(5, 2, 1000)
# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500
# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)
rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)
Run Code Online (Sandbox Code Playgroud)
我得到的输出是:
Mean= 4.99760520022
Median= 4.95395274981
Max/Min= 11.1226494654 -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649
Mean= 9.64760520022
Median= 4.95667658782
Max/Min= 2205.43861943 -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694
Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462
Run Code Online (Sandbox Code Playgroud)
这接近于预期值 2。
如果我们想要删除高于/低于 5 个标准差的点(对于 1000 个点,我们预计 1 个值 > 3 个标准差):
y = x[abs(x - p50) < rSig*5]
# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))
Run Code Online (Sandbox Code Playgroud)
这使:
Mean= 4.99755359935
Median= 4.95213030447
Max/Min= 11.1226494654 -2.15388472011
StdDev= 1.97692712883
Run Code Online (Sandbox Code Playgroud)
我不知道哪种方法更有效/更稳健
| 归档时间: |
|
| 查看次数: |
92466 次 |
| 最近记录: |