[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Wed Sep 14 14:48:21 CEST 2005

Dear Martin,

> -----Original Message-----
> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
> Sent: Wednesday, September 14, 2005 3:56 AM
> To: John Fox
> Cc: 'Martin Maechler'; 'Werner Stahel'; 'John Maindonald'; 
> R-devel at stat.math.ethz.ch
> Subject: Re: [Rd] plot(<lm>): new behavior in R-2.2.0 alpha
> 
> Thank you, John, for
> Dear 
> >>>>> "JohnF" == John Fox <jfox at mcmaster.ca>
> >>>>>     on Tue, 13 Sep 2005 16:41:28 -0400 writes:
> 
>     JohnF> A couple of comments on the new plots (numbers 5 and 6):
>     JohnF> Perhaps some more thought could be given to the
>     JohnF> plotted contours for Cook's D (which are 0.5 and 1.0
>     JohnF> in the example -- large Cook's Ds). A rule-of-thumb
>     JohnF> cut-off for this example is 4/(n - p) = 4/(50 - 5) =
>     JohnF> 0.089, and the discrepancy will grow with n.
> 
> That's an interesting suggestion.  Where does the 4/(n-p) 
> come from? or put differently, should I better read in of 
> your books? ;-)

I believe that I got this by transforming a cutoff suggested by Chatterjee
and Hadi for dffits to the Cook's D scale.

> 
> Honestly, I'm so much a fan of R_i / h_ii that I didn't even 
> know that.
> 
>     JohnF> I'm not terribly fond of number 6, since it seems
>     JohnF> natural to me to think of the relationship among
>     JohnF> these quantities as influence on coefficients =
>     JohnF> leverage * outlyingness (which corresponds to 5);
>     JohnF> also note how in the example, the labels for large
>     JohnF> residuals overplot.
> 
> I think John mainly proposed '6' because other proposed it as 
> another good alternative.  From the few examples I've looked 
> at, I haven't got fond at all either.
> 
>     JohnF> Finally, your remarks about balanced data are cogent
>     JohnF> and suggest going with 1:3 in this case (since R_i
>     JohnF> vs. i is pretty redundant with the QQ plot).
> 
> Ah, that's another, maybe better alternative to my proposal.
> 
> One drawback of it is for situations where people do something
> like   	par(mfrow=c(2,2))
> before calling  plot(<lm>) for several fitted lm models, 
> assuming to fill one page for each of the plots.

Good point -- I do that myself in the Rcmdr package.

> and I think that's something I would have done always in such 
> situations where several different models are fitted and compared.
> 
> Maybe plot.lm() should "advance an empty frame" as soon as
>  prod(par("mfrow")) >= 4
> in that case?
> 

That seems a good idea, at least for the default behaviour. An unlikely
complication would occur if the user wanted to put the plots generated by
plot.lm() on a page along with other plots.

Regards,
 John

> Martin
> 
>     JohnF> --------------------------------
>     JohnF> John Fox
>     JohnF> Department of Sociology
>     JohnF> McMaster University
>     JohnF> Hamilton, Ontario
>     JohnF> Canada L8S 4M4
>     JohnF> 905-525-9140x23604
>     JohnF> http://socserv.mcmaster.ca/jfox 
>     JohnF> -------------------------------- 
> 
>     >> -----Original Message-----
>     >> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
>     >> Sent: Tuesday, September 13, 2005 9:18 AM
>     >> To: R-devel at stat.math.ethz.ch
>     >> Cc: John Maindonald; Werner Stahel; John Fox
>     >> Subject: plot(<lm>): new behavior in R-2.2.0 alpha
>     >> 
>     >> As some of you R-devel readers may know, the plot() method 
>     >> for "lm" objects is based in large parts on contributions by 
>     >> John Maindonald, subsequently "massaged" by me and other 
>     >> R-core members.
>     >> 
>     >> In the statistics litterature on applied regression, people 
>     >> have had  diverse oppinions on what (and how many!) plots 
>     >> should be used for goodness-of-fit / residual diagnostics, 
>     >> and to my knowledge most people have agreed to want to see 
>     >> one (or more) version of a Tukey-Anscombe plot {Residuals ~ 
>     >> Fitted} and a QQ normal plot.
>     >> Another consideration was to be somewhat close to what S
>     >> (S-plus) was doing.  So we have two versions of residuals vs 
>     >> fitted, one for checking  E[error] = 0, the other for 
>     >> checking Var[error] = constant.  So we got to the 
> first three plots of
>     >> plot.lm() about which I don't want to debate at the moment 
>     >> {though, there's room for improvement even there: e.g., I 
>     >> know of at  least one case where plot(<lm>) wasn't used 
>     >> because the user  was missing the qqline() she was so used to 
>     >> in the QQ plot}
>     >> 
>     >> The topic of this e-mail is the (default) 4th plot which I 
>     >> had changed; really prompted by the following:
>     >> More than three months ago, John wrote
>     >> http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
>     >> (which became a thread of about 20 messages, from Apr.23 
>     >> -- 29, 2005)
>     >> 
>     >> and currently,
>     >> NEWS for R 2.2.0 alpha contains
>     >> 
>     >> >> USER-VISIBLE CHANGES
>     >> >> 
>     >> >>    o plot(<lm object>) uses a new default for the 
> fourth panel when
>     >> >>      'which' is not specified.
>     >> >>      ___ may change before release ___
>     >> 
>     >> and the header is
>     >> 
>     >> plot.lm <-
>     >> function (x, which = c(1:3, 5), 
>     >> caption = c("Residuals vs Fitted", 
>     >> "Normal Q-Q", "Scale-Location", 
>     >> "Cook's distance", "Residuals vs Leverage", 
>     >> "Cook's distance vs Leverage"), 
>     >> ......... ) {..............}
>     >> 
>     >> So we now have 6 possible plots, where 1,2,3 and 5 are the 
>     >> defaults (and 1,2,3,4 where the old defaults).
>     >> 
>     >> For the influential points and combination of 'influential' 
>     >> and 'outlier'
>     >> there have been quite a few more proposals in the past. R <= 
>     >> 2.1.x has been plotting the  Cook's distances vs. observation 
>     >> number, whereas quite a few people in the past have noted 
>     >> that all influence measures being more or less complicated 
>     >> functions of residuals and "hat values" aka "leverages", 
>     >> (R_i, h_{ii}), it would really make sense and fit more to the 
>     >> other plots to plot residuals vs. Leverages --- with the 
>     >> additional idea of adding *contours* of (equal) Cook's 
>     >> distances to that plot, in case one would really want 
> to seem them.
>     >> 
>     >> In the mean time, this has been *active* in R-devel for quite 
>     >> a while, and we haven't received any new comments.
>     >> 
>     >> One remaining problem I'd like to address is the "balanced AOV"
>     >> situation, something probably pretty rare nowadays in real 
>     >> practice, but common of course in teaching ANOVA.
>     >> As you may remember, in a balanced design, all observations 
>     >> have the same leverages h_{ii}, and the plot  R_i  vs  h_ii 
>     >> is really not so useful.  In that case,  the cook distances 
>     >> CD_i = c *  R_i ^2 and so  CD_i  vs  i {the old "4-th plot in 
>     >> plot.lm"} is
>     >> graphically identical to   R_i^2 vs i.
>     >> Now in that case (of identical h_ii's), I think one would 
>     >> really want  "R_i  vs  i".
>     >> 
>     >> Question to the interested parties:
>     >> 
>     >> Should there be an automatism
>     >> ``when h_ii == const''  {"==" with a bit of numerical fuzz}
>     >> plot a)  R_i   vs i
>     >> or   b)  CD_i  vs i
>     >> 
>     >> or should users have to manually use
>     >> plot(<lm>,  which=1:4, ...)
>     >> in such a case?
>     >> 
>     >> Feedback very welcome,
>     >> particularly, you first look at the examples in help(plot.lm) 
>     >> in *R-devel* aka R-2.2.0 alpha.
>     >> 
>     >> Martin Maechler, ETH Zurich