有条件地将数据点指向R中的置信带之外

D W*_*D W 7 statistics plot r linear-regression confidence-interval

我需要为下面的图中的置信带之外的数据点着色不同于带内的数据点.我应该在数据集中添加单独的列来记录数据点是否在置信带内吗?你能提供一个例子吗?

情节乐队情节

示例数据集:

## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html

## Disease severity as a function of temperature

# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)

# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)

## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))

## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)

# Take a look at the data
plot(
  diseasesev~temperature,
  data=severity,
  xlab="Temperature",
  ylab="% Disease Severity",
  pch=16,
  pty="s",
  xlim=c(0,30),
  ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
par(new=TRUE) # don't start a new plot

## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
                        lm(diseasesev~temperature),
                        xRange,
                        level=0.95,
                        interval="confidence"
                    )

## Plot lines derrived from best fit line and confidence band datapoints
matplot(
  xRange,
  pred4plot,
  lty=c(1,2,2),   #vector of line types and widths
  type="l",       #type of plot for each column of y
  xlim=c(0,30),
  ylim=c(0,30),
  xlab="",
  ylab=""
)
Run Code Online (Sandbox Code Playgroud)

Ian*_*ows 10

嗯,我认为这对ggplot2来说非常简单,但现在我意识到我不知道如何计算stat_smooth/geom_smooth的置信限.

考虑以下:

library(ggplot2)
pred <- as.data.frame(predict(severity.lm,level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature, 
    in_interval = diseasesev <=pred$upr & diseasesev >=pred$lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm')  + geom_point(aes(colour=in_interval)) +
    geom_line(aes(y=lwr),colour=I('red')) + geom_line(aes(y=upr),colour=I('red'))
Run Code Online (Sandbox Code Playgroud)

这会产生: alt文本http://ifellows.ucsd.edu/pmwiki/uploads/Main/strangeplot.jpg

我不明白为什么stat_smooth计算的置信带与直接从predict(即红线)计算的带不一致.任何人都可以对此有所了解吗?

编辑:

弄清楚了.ggplot2使用1.96*标准误差来绘制所有平滑方法的间隔.

pred <- as.data.frame(predict(severity.lm,se.fit=TRUE,
        level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature, 
    in_interval = diseasesev <=pred$fit.upr & diseasesev >=pred$fit.lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
    stat_smooth(method='lm')  + 
    geom_point(aes(colour=in_interval)) +
    geom_line(aes(y=fit.lwr),colour=I('red')) + 
    geom_line(aes(y=fit.upr),colour=I('red')) +
    geom_line(aes(y=fit.fit-1.96*se.fit),colour=I('green')) + 
    geom_line(aes(y=fit.fit+1.96*se.fit),colour=I('green'))
Run Code Online (Sandbox Code Playgroud)


Sha*_*pie 6

最简单的方法可能是计算一个TRUE/FALSE值向量,指示数据点是否在置信区间内.我将稍微重新调整您的示例,以便在执行绘图命令之前完成所有计算 - 这在程序逻辑中提供了一个清晰的分离,如果您将其中的一部分打包成一个函数,则可以利用它.

第一部分是几乎相同的,除了我替换成其他调用lm()内部predict()severity.lm可变没有必要使用额外的计算资源来重新计算线性模型时,我们已经将其存储:

## Dataset from 
#  apsnet.org/education/advancedplantpath/topics/
#    RModules/doc1/04_Linear_regression.html

## Disease severity as a function of temperature

# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)

# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)

## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))

## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)

## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
  severity.lm,
  xRange,
  level=0.95,
  interval="confidence"
)
Run Code Online (Sandbox Code Playgroud)

现在,我们将计算原始数据点的置信区间并运行测试以查看这些点是否在区间内:

modelConfInt <- predict(
  severity.lm,
  level = 0.95,
  interval = "confidence"
)

insideInterval <- modelConfInt[,'lwr'] < severity[['diseasesev']] &
  severity[['diseasesev']] < modelConfInt[,'upr']
Run Code Online (Sandbox Code Playgroud)

然后我们将完成绘图 - 首先是高级绘图功能plot(),正如您在示例中使用的那样,但我们只绘制区间内的点.然后我们将跟进低级函数points(),该函数将以不同的颜色绘制区间外的所有点.最后,matplot()将用于填写您使用它时的置信区间.然而,不是调用par(new=TRUE)我更喜欢将参数传递add=TRUE给高级函数,使它们像低级函数一样.

使用par(new=TRUE)就像玩一个肮脏的技巧一个绘图功能 - 这可能会产生无法预料的后果.该add参数由许多函数提供,以使它们向绘图添加信息而不是重绘它 - 我建议尽可能利用这个参数并par()作为最后的手段回退操作.

# Take a look at the data- those points inside the interval
plot(
  diseasesev~temperature,
  data=severity[ insideInterval,],
  xlab="Temperature",
  ylab="% Disease Severity",
  pch=16,
  pty="s",
  xlim=c(0,30),
  ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")

# Add points outside the interval, color differently
points(
  diseasesev~temperature,
  pch = 16,
  col = 'red',
  data = severity[ !insideInterval,]
)

# Add regression line and confidence intervals
matplot(
  xRange,
  pred4plot,
  lty=c(1,2,2),   #vector of line types and widths
  type="l",       #type of plot for each column of y
  add = TRUE
)
Run Code Online (Sandbox Code Playgroud)