R:如何从ggplot2中更顺畅地删除异常值?

Joh*_*n 5 statistics r outliers ggplot2

我有以下数据集,我试图用ggplot2绘图,它是三个实验A1,B1和C1的时间序列,每个实验有三个重复.

我想添加一个stat,它可以在返回更平滑(均值和方差?)之前检测并删除异常值.我已经编写了自己的异常函数(未显示),但我希望已经有一个函数来执行此操作,我只是没有找到它.

我从ggplot2书中的一些例子看过stat_sum_df("median_hilow",geom ="smooth"),但我不明白Hmisc的帮助文档,看它是否删除了异常值.

是否有一个函数可以在ggplot中删除这样的异常值,或者我在哪里修改我的代码以添加我自己的函数?

编辑:我刚看到这个(如何在R代码中使用异常值测试)并注意到Hadley建议使用一个强大的方法,如rlm.我正在绘制细菌生长曲线,所以我不认为线性模型是最好的,但是在这种情况下对其他模型或使用或使用稳健模型的任何建议都将受到重视.

library (ggplot2)  

data = data.frame (day = c(1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7), od = 
c(
0.1,1.0,0.5,0.7
,0.13,0.33,0.54,0.76
,0.1,0.35,0.54,0.73
,1.3,1.5,1.75,1.7
,1.3,1.3,1.0,1.6
,1.7,1.6,1.75,1.7
,2.1,2.3,2.5,2.7
,2.5,2.6,2.6,2.8
,2.3,2.5,2.8,3.8), 
series_id = c(
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1"),
replicate = c(
"A1.1","A1.1","A1.1","A1.1",
"A1.2","A1.2","A1.2","A1.2",
"A1.3","A1.3","A1.3","A1.3",
"B1.1","B1.1","B1.1","B1.1",
"B1.2","B1.2","B1.2","B1.2",
"B1.3","B1.3","B1.3","B1.3",
"C1.1","C1.1","C1.1","C1.1",
"C1.2","C1.2","C1.2","C1.2",
"C1.3","C1.3","C1.3","C1.3"))

> data
   day   od series_id replicate
1    1 0.10        A1      A1.1
2    3 1.00        A1      A1.1
3    5 0.50        A1      A1.1
4    7 0.70        A1      A1.1
5    1 0.13        A1      A1.2
6    3 0.33        A1      A1.2
7    5 0.54        A1      A1.2
8    7 0.76        A1      A1.2
9    1 0.10        A1      A1.3
10   3 0.35        A1      A1.3
11   5 0.54        A1      A1.3
12   7 0.73        A1      A1.3
13   1 1.30        B1      B1.1
... etc...
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止所做的并且工作得很好,但不删除异常值:

r <- ggplot(data = data, aes(x = day, y = od))
r + geom_point(aes(group = replicate, color = series_id)) + # add points
   geom_line(aes(group = replicate, color = series_id)) + # add lines
   geom_smooth(aes(group = series_id))  # add smoother, average of each replicate
Run Code Online (Sandbox Code Playgroud)

编辑:我刚刚在下面添加了两个图表,显示了我从实际数据而不是上面的示例数据中获得的离群值问题的示例.

第一个图显示了p26s4系列,大约在第32天,在两个重复中出现了一些非常奇怪的事情,显示了2个异常值.

第二个图显示了系列p22s5,在第18天,当天阅读时发生了一些奇怪的事情,我认为可能是机器错误.

目前,我正在关注数据,检查增长曲线是否正常.在采纳哈德利的建议并设定家庭="对称"后,我相信黄土更顺畅地做了一个忽视异常值的体面工作.

p26s4显示大约第32天在两个重复中发生了一些非常奇怪的事情,显示出2个异常值http://img696.imageshack.us/img696/8743/p26s4loess.png p22s5显示在第18天,一些奇怪的事情随着阅读而继续那天,我觉得可能是机器错误http://img521.imageshack.us/img521/8083/p22s5loess.png

@ Peter/@ hadley,接下来我想做的是尝试将逻辑,gompertz或richard的增长曲线拟合到这个数据而不是黄土,并计算指数阶段的增长率.最后我打算在R中使用grofit包(http://cran.r-project.org/web/packages/grofit/index.html),但是现在我想用ggplot2手动绘制这些包.如果您有任何指示,那将非常感激.

had*_*ley 14

你有没有尝试过这个family = "symmetric"参数geom_smooth(反过来会被传递给loess)?这将使黄土光滑抵抗异常值.

但是,查看您的数据,为什么您认为线性拟合是不够的?您只有4个x值,并且似乎没有强有力的证据表明偏离线性.

  • 弄清楚了!正确的语法是`geom_smooth(method = loess,method.args = list(family =“ symmetric”))` (2认同)