[Rd] plot(<lm>): new behavior in R-2.2.0 alpha
Martin Maechler
maechler at stat.math.ethz.ch
Tue Sep 13 16:17:45 CEST 2005
As some of you R-devel readers may know, the plot() method for
"lm" objects is based in large parts on contributions by John
Maindonald, subsequently "massaged" by me and other R-core
members.
In the statistics litterature on applied regression, people have
had diverse oppinions on what (and how many!) plots should be
used for goodness-of-fit / residual diagnostics, and to my
knowledge most people have agreed to want to see one (or more)
version of a Tukey-Anscombe plot {Residuals ~ Fitted} and a QQ
normal plot.
Another consideration was to be somewhat close to what S
(S-plus) was doing. So we have two versions of residuals vs
fitted, one for checking E[error] = 0, the other for checking
Var[error] = constant. So we got to the first three plots of
plot.lm() about which I don't want to debate at the moment
{though, there's room for improvement even there: e.g., I know of at
least one case where plot(<lm>) wasn't used because the user
was missing the qqline() she was so used to in the QQ plot}
The topic of this e-mail is the (default) 4th plot which I had
changed; really prompted by the following:
More than three months ago, John wrote
http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
(which became a thread of about 20 messages, from Apr.23 -- 29, 2005)
and currently,
NEWS for R 2.2.0 alpha contains
>> USER-VISIBLE CHANGES
>>
>> o plot(<lm object>) uses a new default for the fourth panel when
>> 'which' is not specified.
>> ___ may change before release ___
and the header is
plot.lm <-
function (x, which = c(1:3, 5),
caption = c("Residuals vs Fitted",
"Normal Q-Q", "Scale-Location",
"Cook's distance", "Residuals vs Leverage",
"Cook's distance vs Leverage"),
......... ) {..............}
So we now have 6 possible plots, where 1,2,3 and 5 are the
defaults (and 1,2,3,4 where the old defaults).
For the influential points and combination of 'influential' and 'outlier'
there have been quite a few more proposals in the past. R <= 2.1.x
has been plotting the Cook's distances vs. observation number, whereas
quite a few people in the past have noted that all influence
measures being more or less complicated functions of residuals
and "hat values" aka "leverages", (R_i, h_{ii}), it would really
make sense and fit more to the other plots
to plot residuals vs. Leverages --- with the additional idea of
adding *contours* of (equal) Cook's distances to that plot, in
case one would really want to seem them.
In the mean time, this has been *active* in R-devel for quite a
while, and we haven't received any new comments.
One remaining problem I'd like to address is the "balanced AOV"
situation, something probably pretty rare nowadays in real
practice, but common of course in teaching ANOVA.
As you may remember, in a balanced design, all observations have
the same leverages h_{ii}, and the plot R_i vs h_ii is really
not so useful. In that case, the cook distances CD_i = c * R_i ^2
and so CD_i vs i {the old "4-th plot in plot.lm"} is
graphically identical to R_i^2 vs i.
Now in that case (of identical h_ii's), I think one would really
want "R_i vs i".
Question to the interested parties:
Should there be an automatism
``when h_ii == const'' {"==" with a bit of numerical fuzz}
plot a) R_i vs i
or b) CD_i vs i
or should users have to manually use
plot(<lm>, which=1:4, ...)
in such a case?
Feedback very welcome,
particularly, you first look at the examples in help(plot.lm)
in *R-devel* aka R-2.2.0 alpha.
Martin Maechler, ETH Zurich
More information about the R-devel
mailing list