Tho*_*hew 5 python numpy matplotlib seaborn
将"提示"数据集绘制为关节图,我想根据"提示"数据框中的索引标记图表中的前10个异常值(或前n个异常值).我计算残差(点与平均线的距离)来找出异常值.请忽略这种异常值检测方法的优点.我只想根据规范注释图表.
import seaborn as sns
sns.set(style="darkgrid", color_codes=True)
tips = sns.load_dataset("tips")
model = pd.ols(y=tips.tip, x=tips.total_bill)
tips['resid'] = model.resid
#indices to annotate
tips.sort_values(by=['resid'], ascending=[False]).head(5)
Run Code Online (Sandbox Code Playgroud)
tips.sort_values(by=['resid'], ascending=[False]).tail(5)
Run Code Online (Sandbox Code Playgroud)
%matplotlib inline
g = sns.jointplot("total_bill", "tip", data=tips, kind="reg",
xlim=(0, 60), ylim=(0, 12), color="r", size=7)
Run Code Online (Sandbox Code Playgroud)
如何通过每个点的索引值(最大残差)对图表中的前10个异常值(最大5个和最小5个残差)进行注释,以实现此目的:
您可以使用matplotlib annotate为点创建注释.我们的想法是迭代数据框并将注释放在"tip"和"total_bill"列给出的相应位置.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid", color_codes=True)
tips = sns.load_dataset("tips")
model = pd.ols(y=tips.tip, x=tips.total_bill)
tips['resid'] = model.resid
g = sns.jointplot("total_bill", "tip", data=tips, kind="reg",
xlim=(0, 60), ylim=(0, 12), color="r", size=7)
#indices to annotate
head = tips.sort_values(by=['resid'], ascending=[False]).head(5)
tail = tips.sort_values(by=['resid'], ascending=[False]).tail(5)
def ann(row):
ind = row[0]
r = row[1]
plt.gca().annotate(ind, xy=(r["total_bill"], r["tip"]),
xytext=(2,2) , textcoords ="offset points", )
for row in head.iterrows():
ann(row)
for row in tail.iterrows():
ann(row)
plt.show()
Run Code Online (Sandbox Code Playgroud)
pandas.ols 开始,已删除.要更换它,人们可以使用OLS模型的statsmodels.然后各行将读取:
import statsmodels.api as sm
model = sm.OLS(tips.tip, tips.total_bill)
tips['resid'] = model.fit().resid
Run Code Online (Sandbox Code Playgroud)
请注意,结果略有不同(可能是由于不同的权重).