[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Wed Sep 14 10:55:48 CEST 2005

Thank you, John, for
Dear 
>>>>> "JohnF" == John Fox <jfox at mcmaster.ca>
>>>>>     on Tue, 13 Sep 2005 16:41:28 -0400 writes:

    JohnF> A couple of comments on the new plots (numbers 5 and 6):
    JohnF> Perhaps some more thought could be given to the
    JohnF> plotted contours for Cook's D (which are 0.5 and 1.0
    JohnF> in the example -- large Cook's Ds). A rule-of-thumb
    JohnF> cut-off for this example is 4/(n - p) = 4/(50 - 5) =
    JohnF> 0.089, and the discrepancy will grow with n.

That's an interesting suggestion.  Where does the 4/(n-p) come
from? or put differently, should I better read in of your books? ;-)

Honestly, I'm so much a fan of R_i / h_ii that I didn't even
know that.

    JohnF> I'm not terribly fond of number 6, since it seems
    JohnF> natural to me to think of the relationship among
    JohnF> these quantities as influence on coefficients =
    JohnF> leverage * outlyingness (which corresponds to 5);
    JohnF> also note how in the example, the labels for large
    JohnF> residuals overplot.

I think John mainly proposed '6' because other proposed it as another
good alternative.  From the few examples I've looked at, I
haven't got fond at all either.

    JohnF> Finally, your remarks about balanced data are cogent
    JohnF> and suggest going with 1:3 in this case (since R_i
    JohnF> vs. i is pretty redundant with the QQ plot).

Ah, that's another, maybe better alternative to my proposal.

One drawback of it is for situations where people do something
like   	par(mfrow=c(2,2))
before calling  plot(<lm>) for several fitted lm models,
assuming to fill one page for each of the plots.
and I think that's something I would have done always in such
situations where several different models are fitted and compared.

Maybe plot.lm() should "advance an empty frame" as soon as
 prod(par("mfrow")) >= 4 
in that case?

Martin

    JohnF> --------------------------------
    JohnF> John Fox
    JohnF> Department of Sociology
    JohnF> McMaster University
    JohnF> Hamilton, Ontario
    JohnF> Canada L8S 4M4
    JohnF> 905-525-9140x23604
    JohnF> http://socserv.mcmaster.ca/jfox 
    JohnF> -------------------------------- 

    >> -----Original Message-----
    >> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
    >> Sent: Tuesday, September 13, 2005 9:18 AM
    >> To: R-devel at stat.math.ethz.ch
    >> Cc: John Maindonald; Werner Stahel; John Fox
    >> Subject: plot(<lm>): new behavior in R-2.2.0 alpha
    >> 
    >> As some of you R-devel readers may know, the plot() method 
    >> for "lm" objects is based in large parts on contributions by 
    >> John Maindonald, subsequently "massaged" by me and other 
    >> R-core members.
    >> 
    >> In the statistics litterature on applied regression, people 
    >> have had  diverse oppinions on what (and how many!) plots 
    >> should be used for goodness-of-fit / residual diagnostics, 
    >> and to my knowledge most people have agreed to want to see 
    >> one (or more) version of a Tukey-Anscombe plot {Residuals ~ 
    >> Fitted} and a QQ normal plot.
    >> Another consideration was to be somewhat close to what S
    >> (S-plus) was doing.  So we have two versions of residuals vs 
    >> fitted, one for checking  E[error] = 0, the other for 
    >> checking Var[error] = constant.  So we got to the first three plots of
    >> plot.lm() about which I don't want to debate at the moment 
    >> {though, there's room for improvement even there: e.g., I 
    >> know of at  least one case where plot(<lm>) wasn't used 
    >> because the user  was missing the qqline() she was so used to 
    >> in the QQ plot}
    >> 
    >> The topic of this e-mail is the (default) 4th plot which I 
    >> had changed; really prompted by the following:
    >> More than three months ago, John wrote
    >> http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
    >> (which became a thread of about 20 messages, from Apr.23 
    >> -- 29, 2005)
    >> 
    >> and currently,
    >> NEWS for R 2.2.0 alpha contains
    >> 
    >> >> USER-VISIBLE CHANGES
    >> >> 
    >> >>    o plot(<lm object>) uses a new default for the fourth panel when
    >> >>      'which' is not specified.
    >> >>      ___ may change before release ___
    >> 
    >> and the header is
    >> 
    >> plot.lm <-
    >> function (x, which = c(1:3, 5), 
    >> caption = c("Residuals vs Fitted", 
    >> "Normal Q-Q", "Scale-Location", 
    >> "Cook's distance", "Residuals vs Leverage", 
    >> "Cook's distance vs Leverage"), 
    >> ......... ) {..............}
    >> 
    >> So we now have 6 possible plots, where 1,2,3 and 5 are the 
    >> defaults (and 1,2,3,4 where the old defaults).
    >> 
    >> For the influential points and combination of 'influential' 
    >> and 'outlier'
    >> there have been quite a few more proposals in the past. R <= 
    >> 2.1.x has been plotting the  Cook's distances vs. observation 
    >> number, whereas quite a few people in the past have noted 
    >> that all influence measures being more or less complicated 
    >> functions of residuals and "hat values" aka "leverages", 
    >> (R_i, h_{ii}), it would really make sense and fit more to the 
    >> other plots to plot residuals vs. Leverages --- with the 
    >> additional idea of adding *contours* of (equal) Cook's 
    >> distances to that plot, in case one would really want to seem them.
    >> 
    >> In the mean time, this has been *active* in R-devel for quite 
    >> a while, and we haven't received any new comments.
    >> 
    >> One remaining problem I'd like to address is the "balanced AOV"
    >> situation, something probably pretty rare nowadays in real 
    >> practice, but common of course in teaching ANOVA.
    >> As you may remember, in a balanced design, all observations 
    >> have the same leverages h_{ii}, and the plot  R_i  vs  h_ii 
    >> is really not so useful.  In that case,  the cook distances 
    >> CD_i = c *  R_i ^2 and so  CD_i  vs  i {the old "4-th plot in 
    >> plot.lm"} is
    >> graphically identical to   R_i^2 vs i.
    >> Now in that case (of identical h_ii's), I think one would 
    >> really want  "R_i  vs  i".
    >> 
    >> Question to the interested parties:
    >> 
    >> Should there be an automatism
    >> ``when h_ii == const''  {"==" with a bit of numerical fuzz}
    >> plot a)  R_i   vs i
    >> or   b)  CD_i  vs i
    >> 
    >> or should users have to manually use
    >> plot(<lm>,  which=1:4, ...)
    >> in such a case?
    >> 
    >> Feedback very welcome,
    >> particularly, you first look at the examples in help(plot.lm) 
    >> in *R-devel* aka R-2.2.0 alpha.
    >> 
    >> Martin Maechler, ETH Zurich