[R] Plot different regression models on one graph

Mon Feb 15 08:31:35 CET 2010

Dear Peter,

Ah, I see your point, Professor. The point at x=23.5 is carrying the model.
Allow me to clarify. I was making a similar point.

I was suggesting that the cube term could be left in the model, as you did,
but rather than dropping the data-point, another model is fit with an
additional parameter added to control for the suspected outlier -
essentially a point discontinuity.

Note that the call to poly() is not necessary for the graphical
representation, so I'll continue the example without it.
fm3<-lm(y~x+I(x^2)+I(x^3), data=dat)
fm3.c<-coef(fm3)

# drop data point alternative subseting using logical indexes.
index<-dat$x!=23.5
x2<-x[index]
x2<-dat$x[index]
y2<-dat$y[index]
fm4<-lm(y2~x2+I(x2^2)+I(x2^3), data=dat)

# controlling for the potential outlier in the model, without throwing out
the data. 
ctrl<-rep(0,length=dim(dat)[1])
ctrl[dat$x==23.5]<-resid(fm3)[[dat$x==23.5]
fm5<-lm(y~x+I(x^2)+I(x^3)+ctrl, data=dat)

# avoids the predict & lines calls, but requires knowledge of the model. 
curve(fm3.c[1]+fm3.c[2]*x+fm3.c[3]*x^2+fm3.c[4]*x^3, col="green", add=TRUE)
curve(fm4.c[1]+fm4.c[2]*x+fm4.c[3]*x^2+fm4.c[4]*x^3, col="green", add=TRUE)
# plot dropped outlier
curve(fm5.c[1]+fm5.c[2]*x+fm5.c[3]*x^2+fm5.c[4]*x^3, col="purple", add=TRUE)
# plot without ctrl variable

anova(fm5) 

Tells us how much difference one point made, sacrificing a df just to ensure
we're not throwing out information haphazardly. F>1 or whatever cutoff you
want to use. If it turns out that the point is important, then at least its
existence & effect was documented. 

There is some irony, I suppose, that the graphical representation of
dropping the point vs. adding a control variable, is equivalent for this
example. I have not decided how I feel about that yet, but I do have a
splitting headache. 

Sincerely,
KeithC.

-----Original Message-----
From: Peter Ehlers [mailto:ehlers at ucalgary.ca] 
Sent: Sunday, February 14, 2010 10:04 PM
To: kMan
Cc: r-help at r-project.org
Subject: Re: [R] Plot different regression models on one graph

kMan wrote:
> Peter wrote:
>> You like to live dangerously.
> 
> Clue me in, Professor.
> 
> Sincerely,
> KeithC.
> 

Okay, Keith, here goes:

dat <-
structure(list(x = c(62.5, 68.5, 0, 52, 0, 52, 0, 52, 23.5, 86, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), y = c(0.054, 0.055, 0.017, 0.021, 0.02, 0.028, 0.032,
0.073, 0.076, 0.087, 0.042, 0.042, 0.041, 0.045, 0.021, 0.018, 0.017, 0.018,
0.028, 0.022)), .Names = c("x", "y"), row.names = c(NA, -20L), class =
"data.frame")

fm1 <- lm(y ~ poly(x,3), data = dat)
fm2 <- lm(y ~ poly(x,3), data = dat, subset = -9)

xx <- 0:86
yhat1 <- predict(fm1, list(x = xx))
yhat2 <- predict(fm2, list(x = xx))

plot(x,y)
lines(xx, yhat1, col="blue", lwd=2)
lines(xx, yhat2, col="red", lwd=2)

That's how much difference *one* point makes in a cubic fit!
I'm not much of a gambler, so I wouldn't base any important decisions on the
results of the fit.

Cheers,

  -Peter Ehlers

> -----Original Message-----
> From: Peter Ehlers [mailto:ehlers at ucalgary.ca]
> Sent: Sunday, February 14, 2010 6:49 PM
> To: kMan
> Cc: 'David Winsemius'; 'Rhonda Reidy'; r-help at r-project.org
> Subject: Re: [R] Plot different regression models on one graph
> 
> kMan wrote:
>> I would use all of the data. If you want to "drop" one, control for 
>> it in the model & sacrifice a degree of freedom.
> 
> You like to live dangerously.
> 
>   -Peter Ehlers
> 
>> Why the call to poly() by the way?
>>
>> KeithC.
>>
>> -----Original Message-----
>> From: Peter Ehlers [mailto:ehlers at ucalgary.ca]
>> Sent: Saturday, February 13, 2010 1:35 PM
>> To: David Winsemius
>> Cc: Rhonda Reidy; r-help at r-project.org
>> Subject: Re: [R] Plot different regression models on one graph
>>
>> Rhonda:
>>
>> As David points out, a cubic fit is rather speculative. I think that 
>> one needs to distinguish two situations: 1) theoretical justification 
>> for a cubic model is available, or 2) you're dredging the data for the
"best"
> fit.
>> Your case is the second. That puts you in the realm of EDA 
>> (exploratory
> data
>> analysis). You're free to fit any model you wish, but you should 
>> assess
> the
>> value of the model. One convenient way is to use the 
>> influence.measures() function, which will show that points 9 and 10 
>> in your data have a strong influence on your cubic fit (as, of 
>> course, your eyes would tell you). A good thing to do at this point is to
fit 3 more cubic models:
>> 1) omitting point 9, 2) omitting point 10, 3) omitting both.
>>
>> Try it and plot the resulting fits. You may be surprised.
>>
>> Conclusion: Without more data, you can conclude nothing about a 
>> non-straightline fit.
>>
>> (And this hasn't even addressed the relative abundance of x=0 data.)
>>
>>   -Peter Ehlers
>>
>> David Winsemius wrote:
>>> On Feb 13, 2010, at 1:35 PM, Rhonda Reidy wrote:
>>>
>>>> The following variables have the following significant 
>>>> relationships (x is the explanatory variable): linear, cubic,
exponential, logistic.
>>>> The linear relationship plots without any trouble.
>>>>
>>>> Cubic is the 'best' model, but it is not plotting as a smooth curve 
>>>> using the following code:
>>>>
>>>> cubic.lm<- lm(y~poly(x,3))
>>> Try:
>>>
>>> lines(0:80, predict(cubic.lm, data.frame(x=0:80)),lwd=2)
>>>
>>> But I really must say the superiority of that relationship over a 
>>> linear one is far from convincing to my eyes. Seems to be caused by 
>>> one data point. I hope the quotes around "best" mean tha tyou have 
>>> the
>> same qualms.
>>>> lines(x,predict(cubic.lm),lwd=2)
>>>>
>>>> How do I plot the data and the estimated curves for all of these 
>>>> regression models in the same plot?
>>>>
>>>> x <- c(62.5,68.5,0,52,0,52,0,52,23.5,86,0,0,0,0,0,0,0,0,0,0)
>>>>
>>>> y <-
>>>> c(0.054,0.055,0.017,0.021,0.020,0.028,0.032,0.073,0.076,0.087,0.042
>>>> ,0
>>>> .042,0.041,0.045,0.021,0.018,0.017,0.018,0.028,0.022)
>>>>
>>>>
>>>> Thanks in advance.
>>>>
>>>> Rhonda Reidy
>>>>
>> --
>> Peter Ehlers
>> University of Calgary
>>
>>
>>