[Rd] plot(<lm>): new behavior in R-2.2.0 alpha
John Fox
jfox at mcmaster.ca
Wed Sep 14 14:48:21 CEST 2005
Dear Martin,
> -----Original Message-----
> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch]
> Sent: Wednesday, September 14, 2005 3:56 AM
> To: John Fox
> Cc: 'Martin Maechler'; 'Werner Stahel'; 'John Maindonald';
> R-devel at stat.math.ethz.ch
> Subject: Re: [Rd] plot(<lm>): new behavior in R-2.2.0 alpha
>
> Thank you, John, for
> Dear
> >>>>> "JohnF" == John Fox <jfox at mcmaster.ca>
> >>>>> on Tue, 13 Sep 2005 16:41:28 -0400 writes:
>
> JohnF> A couple of comments on the new plots (numbers 5 and 6):
> JohnF> Perhaps some more thought could be given to the
> JohnF> plotted contours for Cook's D (which are 0.5 and 1.0
> JohnF> in the example -- large Cook's Ds). A rule-of-thumb
> JohnF> cut-off for this example is 4/(n - p) = 4/(50 - 5) =
> JohnF> 0.089, and the discrepancy will grow with n.
>
> That's an interesting suggestion. Where does the 4/(n-p)
> come from? or put differently, should I better read in of
> your books? ;-)
I believe that I got this by transforming a cutoff suggested by Chatterjee
and Hadi for dffits to the Cook's D scale.
>
> Honestly, I'm so much a fan of R_i / h_ii that I didn't even
> know that.
>
> JohnF> I'm not terribly fond of number 6, since it seems
> JohnF> natural to me to think of the relationship among
> JohnF> these quantities as influence on coefficients =
> JohnF> leverage * outlyingness (which corresponds to 5);
> JohnF> also note how in the example, the labels for large
> JohnF> residuals overplot.
>
> I think John mainly proposed '6' because other proposed it as
> another good alternative. From the few examples I've looked
> at, I haven't got fond at all either.
>
> JohnF> Finally, your remarks about balanced data are cogent
> JohnF> and suggest going with 1:3 in this case (since R_i
> JohnF> vs. i is pretty redundant with the QQ plot).
>
> Ah, that's another, maybe better alternative to my proposal.
>
> One drawback of it is for situations where people do something
> like par(mfrow=c(2,2))
> before calling plot(<lm>) for several fitted lm models,
> assuming to fill one page for each of the plots.
Good point -- I do that myself in the Rcmdr package.
> and I think that's something I would have done always in such
> situations where several different models are fitted and compared.
>
> Maybe plot.lm() should "advance an empty frame" as soon as
> prod(par("mfrow")) >= 4
> in that case?
>
That seems a good idea, at least for the default behaviour. An unlikely
complication would occur if the user wanted to put the plots generated by
plot.lm() on a page along with other plots.
Regards,
John
> Martin
>
> JohnF> --------------------------------
> JohnF> John Fox
> JohnF> Department of Sociology
> JohnF> McMaster University
> JohnF> Hamilton, Ontario
> JohnF> Canada L8S 4M4
> JohnF> 905-525-9140x23604
> JohnF> http://socserv.mcmaster.ca/jfox
> JohnF> --------------------------------
>
> >> -----Original Message-----
> >> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch]
> >> Sent: Tuesday, September 13, 2005 9:18 AM
> >> To: R-devel at stat.math.ethz.ch
> >> Cc: John Maindonald; Werner Stahel; John Fox
> >> Subject: plot(<lm>): new behavior in R-2.2.0 alpha
> >>
> >> As some of you R-devel readers may know, the plot() method
> >> for "lm" objects is based in large parts on contributions by
> >> John Maindonald, subsequently "massaged" by me and other
> >> R-core members.
> >>
> >> In the statistics litterature on applied regression, people
> >> have had diverse oppinions on what (and how many!) plots
> >> should be used for goodness-of-fit / residual diagnostics,
> >> and to my knowledge most people have agreed to want to see
> >> one (or more) version of a Tukey-Anscombe plot {Residuals ~
> >> Fitted} and a QQ normal plot.
> >> Another consideration was to be somewhat close to what S
> >> (S-plus) was doing. So we have two versions of residuals vs
> >> fitted, one for checking E[error] = 0, the other for
> >> checking Var[error] = constant. So we got to the
> first three plots of
> >> plot.lm() about which I don't want to debate at the moment
> >> {though, there's room for improvement even there: e.g., I
> >> know of at least one case where plot(<lm>) wasn't used
> >> because the user was missing the qqline() she was so used to
> >> in the QQ plot}
> >>
> >> The topic of this e-mail is the (default) 4th plot which I
> >> had changed; really prompted by the following:
> >> More than three months ago, John wrote
> >> http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
> >> (which became a thread of about 20 messages, from Apr.23
> >> -- 29, 2005)
> >>
> >> and currently,
> >> NEWS for R 2.2.0 alpha contains
> >>
> >> >> USER-VISIBLE CHANGES
> >> >>
> >> >> o plot(<lm object>) uses a new default for the
> fourth panel when
> >> >> 'which' is not specified.
> >> >> ___ may change before release ___
> >>
> >> and the header is
> >>
> >> plot.lm <-
> >> function (x, which = c(1:3, 5),
> >> caption = c("Residuals vs Fitted",
> >> "Normal Q-Q", "Scale-Location",
> >> "Cook's distance", "Residuals vs Leverage",
> >> "Cook's distance vs Leverage"),
> >> ......... ) {..............}
> >>
> >> So we now have 6 possible plots, where 1,2,3 and 5 are the
> >> defaults (and 1,2,3,4 where the old defaults).
> >>
> >> For the influential points and combination of 'influential'
> >> and 'outlier'
> >> there have been quite a few more proposals in the past. R <=
> >> 2.1.x has been plotting the Cook's distances vs. observation
> >> number, whereas quite a few people in the past have noted
> >> that all influence measures being more or less complicated
> >> functions of residuals and "hat values" aka "leverages",
> >> (R_i, h_{ii}), it would really make sense and fit more to the
> >> other plots to plot residuals vs. Leverages --- with the
> >> additional idea of adding *contours* of (equal) Cook's
> >> distances to that plot, in case one would really want
> to seem them.
> >>
> >> In the mean time, this has been *active* in R-devel for quite
> >> a while, and we haven't received any new comments.
> >>
> >> One remaining problem I'd like to address is the "balanced AOV"
> >> situation, something probably pretty rare nowadays in real
> >> practice, but common of course in teaching ANOVA.
> >> As you may remember, in a balanced design, all observations
> >> have the same leverages h_{ii}, and the plot R_i vs h_ii
> >> is really not so useful. In that case, the cook distances
> >> CD_i = c * R_i ^2 and so CD_i vs i {the old "4-th plot in
> >> plot.lm"} is
> >> graphically identical to R_i^2 vs i.
> >> Now in that case (of identical h_ii's), I think one would
> >> really want "R_i vs i".
> >>
> >> Question to the interested parties:
> >>
> >> Should there be an automatism
> >> ``when h_ii == const'' {"==" with a bit of numerical fuzz}
> >> plot a) R_i vs i
> >> or b) CD_i vs i
> >>
> >> or should users have to manually use
> >> plot(<lm>, which=1:4, ...)
> >> in such a case?
> >>
> >> Feedback very welcome,
> >> particularly, you first look at the examples in help(plot.lm)
> >> in *R-devel* aka R-2.2.0 alpha.
> >>
> >> Martin Maechler, ETH Zurich
More information about the R-devel
mailing list