Pandas 数据框 - 删除异常值

Question

Pandas 数据框 - 删除异常值

给定一个熊猫数据框，我想根据其中一列排除与异常值（Z 值 = 3）相对应的行。

数据框如下所示：

df.dtypes
_id                   object
_index                object
_score                object
_source.address       object
_source.district      object
_source.price        float64
_source.roomCount    float64
_source.size         float64
_type                 object
sort                  object
priceSquareMeter     float64
dtype: object

Run Code Online (Sandbox Code Playgroud)

对于线路：

dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]

Run Code Online (Sandbox Code Playgroud)

引发以下异常：

-------------------------------------------------------------------------    
TypeError                                 Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]

/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
   2239     """
   2240     a = np.asanyarray(a)
-> 2241     mns = a.mean(axis=axis)
   2242     sstd = a.std(axis=axis, ddof=ddof)
   2243     if axis and mns.ndim < a.ndim:

/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
     68             is_float16_result = True
     69 
---> 70     ret = umr_sum(arr, axis, dtype, out, keepdims)
     71     if isinstance(ret, mu.ndarray):
     72         ret = um.true_divide(

TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

Run Code Online (Sandbox Code Playgroud)

和返回值

np.isreal(df['_source.price']).all()

Run Code Online (Sandbox Code Playgroud)

是

True

Run Code Online (Sandbox Code Playgroud)

为什么会出现上述异常，如何排除异常值？

Answer 1

Her*_*eer 6

如果想要使用给定数据集的四分位距（即 IQR，如下面的维基百科图片所示）（参考）：

def Remove_Outlier_Indices(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
    return trueList

Run Code Online (Sandbox Code Playgroud)

基于上述消除器函数，可以根据数据集的统计内容得到异常值的子集：

# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})

# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)

# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，4 月前
查看次数：	14112 次
最近记录：	7 年前